API for UNICODE

sandeepkavade

hi all, is there any api to check whether the file contains unicode, utf-8 or ansi characters?

Abhijeet Pathak

IsTextUnicode() function can be used to check if the text is unicode... The following code might help you: int IsUnicodeFile(char* szFileName) { FILE *fpUnicode; char l_szCharBuffer[80]; //Open the file if((fpUnicode= fopen(szFileName,"r")) == NULL) return 0; //Unable to open file if(!feof(fpUnicode)) { fread(l_szCharBuffer,80,1,fpUnicode); fclose(fpUnicode); if(IsTextUnicode(l_szCharBuffer,80,NULL)) { return 2; //Text is Unicode } else { return 1; //Text is ASCII } } return 0; // Some error happened } :)

Nibu babu thomas

sandeepkavade wrote:

is there any api to check whether the file contains unicode, utf-8 or ansi characters?

First few bytes of a file determine the nature of a file... If the first three bytes of a file are EF, BB and BF, the file is a UTF-8 file. If the first two bytes are FE and FF, the file is a Unicode file.

Nibu thomas A Developer Code must be written to be read, not by the compiler, but by another human being. http:\\nibuthomas.wordpress.com

sandeepkavade

hi thomas i am very new to VC++. it would be really thankful if you could tell me what is this EF, BB and BF. and how to determine them? Thanx in advance.

Nemanja Trifunovic

Nibu babu thomas wrote:

First few bytes of a file determine the nature of a file... If the first three bytes of a file are EF, BB and BF, the file is a UTF-8 file. If the first two bytes are FE and FF, the file is a Unicode file.

That's not a reliable way to determine whether a file contains Unicode. UTF-8 is not required to start with a byte-order mark, and files with UTF-16LE and UTF-16BE encodings are actually forbiden to start with it.

Programming Blog utf8-cpp

Rage

These are hex numbers : 0xEF = 239, 0xBB= 187, ... Simply read these bytes from the file header and compare them to these numbers.

http://www.readytogiveup.com/[^] - Do something special today. http://www.totalcoaching.ca/[^] - Give me some feedback about this site !

Ralf Lohmueller

Nemanja Trifunovic wrote:

UTF-8 is not required to start with a byte-order mark, and files with UTF-16LE and UTF-16BE encodings are actually forbiden to start with it.

Sorry, why UTF-16(little/big endian) are actually forbidden?