Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C / C++ / MFC
  4. Unicode 5.1 ( Basic Multilingual Plane ) - Language Identification related problem

Unicode 5.1 ( Basic Multilingual Plane ) - Language Identification related problem

Scheduled Pinned Locked Moved C / C++ / MFC
csharpvisual-studiodockerhelpquestion
4 Posts 2 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • P Offline
    P Offline
    PankajB
    wrote on last edited by
    #1

    Hi There. I am creating an application that will take a document as an input and will find Total Number of languages used to compose the same. I am refering below web link to do the same... http://en.wikipedia.org/wiki/Basic_Multilingual_Plane[^] Let me also share some code snippet with you guys...

    FILE *fp;
    long unicode;
    long c;

    fp = fopen(argv[1], "r");
    if(!fp)
    {
    printf("File open failed\n");
    return 0;
    }

    printf("Input Unicode file: %s\n", argv[1]);
    c = fgetc(fp);
    c = fgetc(fp);

    while( (unicode = fgetc(fp)) != EOF)
    {
    long unicode1 = fgetc(fp);
    unicode = (unicode1 << 8) | unicode;
    //(0000–FFFF): Basic Multilingual Plane (BMP).
    if (unicode >= 0x0000 /*0*/ && unicode <= 0x07FF/*2047*/)
    {
    if (unicode >= 0x0000 && unicode <= 0x007F) //Basic Latin (0000–007F)
    {
    unicode_set[Basic_Latin] = 1;
    }
    ....
    ....
    ....
    }
    }
    fclose(fp);

    I got this code from one of my previous projects. But I am not able to understand why are we doing unicode = (unicode1 << 8) | unicode; Also, this method is not able to correctly identify all the chars. Just FYI, I am using VS.NET 2008 with "Charset settings" as "Use Unicode Character Set" Please suggest, if you have any other way to find out, like what all languages we have used to compose a document? Thanks PanB

    S 1 Reply Last reply
    0
    • P PankajB

      Hi There. I am creating an application that will take a document as an input and will find Total Number of languages used to compose the same. I am refering below web link to do the same... http://en.wikipedia.org/wiki/Basic_Multilingual_Plane[^] Let me also share some code snippet with you guys...

      FILE *fp;
      long unicode;
      long c;

      fp = fopen(argv[1], "r");
      if(!fp)
      {
      printf("File open failed\n");
      return 0;
      }

      printf("Input Unicode file: %s\n", argv[1]);
      c = fgetc(fp);
      c = fgetc(fp);

      while( (unicode = fgetc(fp)) != EOF)
      {
      long unicode1 = fgetc(fp);
      unicode = (unicode1 << 8) | unicode;
      //(0000–FFFF): Basic Multilingual Plane (BMP).
      if (unicode >= 0x0000 /*0*/ && unicode <= 0x07FF/*2047*/)
      {
      if (unicode >= 0x0000 && unicode <= 0x007F) //Basic Latin (0000–007F)
      {
      unicode_set[Basic_Latin] = 1;
      }
      ....
      ....
      ....
      }
      }
      fclose(fp);

      I got this code from one of my previous projects. But I am not able to understand why are we doing unicode = (unicode1 << 8) | unicode; Also, this method is not able to correctly identify all the chars. Just FYI, I am using VS.NET 2008 with "Charset settings" as "Use Unicode Character Set" Please suggest, if you have any other way to find out, like what all languages we have used to compose a document? Thanks PanB

      S Offline
      S Offline
      Stuart Dootson
      wrote on last edited by
      #2

      PankajB wrote:

      unicode = (unicode1 << 8) | unicode;

      That converts the two characters unicode and unicode1 into a single 16-bit wide character. It assumes that the wide characters have been stored in a little-endian fashion.

      PankajB wrote:

      Also, this method is not able to correctly identify all the chars.

      Possibly because not al unicode characters fit into 16 bits?

      Java, Basic, who cares - it's all a bunch of tree-hugging hippy cr*p

      P 1 Reply Last reply
      0
      • S Stuart Dootson

        PankajB wrote:

        unicode = (unicode1 << 8) | unicode;

        That converts the two characters unicode and unicode1 into a single 16-bit wide character. It assumes that the wide characters have been stored in a little-endian fashion.

        PankajB wrote:

        Also, this method is not able to correctly identify all the chars.

        Possibly because not al unicode characters fit into 16 bits?

        Java, Basic, who cares - it's all a bunch of tree-hugging hippy cr*p

        P Offline
        P Offline
        PankajB
        wrote on last edited by
        #3

        Thanks for the reply buddy. Can you please suggest some solution for the second problem mentioned above? i.e., Possibly because not al unicode characters fit into 16 bits?

        S 1 Reply Last reply
        0
        • P PankajB

          Thanks for the reply buddy. Can you please suggest some solution for the second problem mentioned above? i.e., Possibly because not al unicode characters fit into 16 bits?

          S Offline
          S Offline
          Stuart Dootson
          wrote on last edited by
          #4

          I'd suggest that you use some library (like, say libiconv[^] to read the file and do the conversion from whatever character encoding is used for the file to a full Unicode enconding (e.g. UTF-32). Then process each of those 32-bit characters the way you are in your original code.

          Java, Basic, who cares - it's all a bunch of tree-hugging hippy cr*p

          1 Reply Last reply
          0
          Reply
          • Reply as topic
          Log in to reply
          • Oldest to Newest
          • Newest to Oldest
          • Most Votes


          • Login

          • Don't have an account? Register

          • Login or register to search.
          • First post
            Last post
          0
          • Categories
          • Recent
          • Tags
          • Popular
          • World
          • Users
          • Groups