Unicode 5.1 ( Basic Multilingual Plane ) - Language Identification related problem

PankajB

Hi There. I am creating an application that will take a document as an input and will find Total Number of languages used to compose the same. I am refering below web link to do the same... http://en.wikipedia.org/wiki/Basic_Multilingual_Plane[^] Let me also share some code snippet with you guys...

FILE *fp;
long unicode;
long c;

fp = fopen(argv[1], "r");
if(!fp)
{
printf("File open failed\n");
return 0;
}

printf("Input Unicode file: %s\n", argv[1]);
c = fgetc(fp);
c = fgetc(fp);

while( (unicode = fgetc(fp)) != EOF)
{
long unicode1 = fgetc(fp);
unicode = (unicode1 << 8) | unicode;
//(0000–FFFF): Basic Multilingual Plane (BMP).
if (unicode >= 0x0000 /*0*/ && unicode <= 0x07FF/*2047*/)
{
if (unicode >= 0x0000 && unicode <= 0x007F) //Basic Latin (0000–007F)
{
unicode_set[Basic_Latin] = 1;
}
....
....
....
}
}
fclose(fp);

I got this code from one of my previous projects. But I am not able to understand why are we doing unicode = (unicode1 << 8) | unicode; Also, this method is not able to correctly identify all the chars. Just FYI, I am using VS.NET 2008 with "Charset settings" as "Use Unicode Character Set" Please suggest, if you have any other way to find out, like what all languages we have used to compose a document? Thanks PanB

Stuart Dootson

PankajB wrote:

unicode = (unicode1 << 8) | unicode;

That converts the two characters unicode and unicode1 into a single 16-bit wide character. It assumes that the wide characters have been stored in a little-endian fashion.

PankajB wrote:

Also, this method is not able to correctly identify all the chars.

Possibly because not al unicode characters fit into 16 bits?

Java, Basic, who cares - it's all a bunch of tree-hugging hippy cr*p

PankajB

Thanks for the reply buddy. Can you please suggest some solution for the second problem mentioned above? i.e., Possibly because not al unicode characters fit into 16 bits?

Stuart Dootson

I'd suggest that you use some library (like, say libiconv[^] to read the file and do the conversion from whatever character encoding is used for the file to a full Unicode enconding (e.g. UTF-32). Then process each of those 32-bit characters the way you are in your original code.

Java, Basic, who cares - it's all a bunch of tree-hugging hippy cr*p