Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C / C++ / MFC
  4. How can read a unicode text file as character by character?

How can read a unicode text file as character by character?

Scheduled Pinned Locked Moved C / C++ / MFC
question
19 Posts 7 Posters 2 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • L Lost User

    Use fgetc()/fgetwc()[^] or wcin[^]. It's exactly the same process as reading non-Unicode.

    I must get a clever new signature for 2011.

    L Offline
    L Offline
    Le rner
    wrote on last edited by
    #8

    i m using this now but its not successful.

    FILE *stream;
    char buffer[2];
    int kk, ch;

       // Open file to read line from:
       fopen\_s( &stream, OpenFile, "r" );
       if( stream == NULL )
       {
    	   return ;
    	  //exit( 0 );
       }
    
       // Read in first 80 characters and place them in "buffer": 
       ch = fgetc( stream );
       for( kk=0; (kk < 1 ) && ( feof( stream ) == 0 ); kk++ )
       {
    	  buffer\[kk\] = (char)ch;
    	  ch = fgetc( stream );
       }
    
       // Add null to end string 
       buffer\[kk\] = '\\0';
       printf( "%s\\n", buffer );
       fclose( stream );
    

    here its unable to open the file and stream is alwaz null thats why its return.

    L M 2 Replies Last reply
    0
    • L Le rner

      i m using this now but its not successful.

      FILE *stream;
      char buffer[2];
      int kk, ch;

         // Open file to read line from:
         fopen\_s( &stream, OpenFile, "r" );
         if( stream == NULL )
         {
      	   return ;
      	  //exit( 0 );
         }
      
         // Read in first 80 characters and place them in "buffer": 
         ch = fgetc( stream );
         for( kk=0; (kk < 1 ) && ( feof( stream ) == 0 ); kk++ )
         {
      	  buffer\[kk\] = (char)ch;
      	  ch = fgetc( stream );
         }
      
         // Add null to end string 
         buffer\[kk\] = '\\0';
         printf( "%s\\n", buffer );
         fclose( stream );
      

      here its unable to open the file and stream is alwaz null thats why its return.

      L Offline
      L Offline
      Lost User
      wrote on last edited by
      #9

      You are ignoring the return code from fopen_s so it's impossible to diagnose your problem. Use something like the following and look up the error code that you receive.

      errno_t errNum = fopen_s( &stream, OpenFile, "r" );
      if (errNum != 0)
      {
      // add some code here to display the error value
      // or set a breakpoint and inspect its contents
      }

      I would suggest you look at the documentation here[^] for further guidance. Incidentally, the rest of your code does not seem to be set up to process Unicode data, which was the subject of your original question.

      I must get a clever new signature for 2011.

      1 Reply Last reply
      0
      • L Le rner

        CPallini wrote:

        What kind of Unicode text file are you dealing with (e.g. UTF-8, etc..)?
        What do you mean with 'character' (e.g. 'Unicode character' or 'single byte'?)?

        Using Unicode encoding type file and one character means single byte.

        A Offline
        A Offline
        Albert Holguin
        wrote on last edited by
        #10

        Hope you do know that unicode is two bytes and ascii is one byte...

        E 1 Reply Last reply
        0
        • A Albert Holguin

          Hope you do know that unicode is two bytes and ascii is one byte...

          E Offline
          E Offline
          Emilio Garavaglia
          wrote on last edited by
          #11

          UNICODE is actually a set of code-points whose cardinality requires 21 bits. When encoded in sequence of 1 bye is called UTF-8 and when encoded as sequence of two bytes is called UTF-16. In UTF-8 coding may vary from 1 to 4 bytes (and remains identical for code-points between 0 and 127, aka ASCII) In UTF-16 coding may be 2 or 4 bytes (and is TWO for the most of Latin, Cyrillic and Greek characters, as many simplified Chinese). UNICODE==2bytes is a misconception that originated at the time Windows included Unicode APIS using 16bits since -at that time- Unicode specs where not so wide. Actually, reading 2bytes does not necessarily means "read a character".

          2 bugs found. > recompile ... 65534 bugs found. :doh:

          A 1 Reply Last reply
          0
          • L Le rner

            CPallini wrote:

            What kind of Unicode text file are you dealing with (e.g. UTF-8, etc..)?
            What do you mean with 'character' (e.g. 'Unicode character' or 'single byte'?)?

            Using Unicode encoding type file and one character means single byte.

            E Offline
            E Offline
            Emilio Garavaglia
            wrote on last edited by
            #12

            This is actually a miscoception ... see here[^].

            2 bugs found. > recompile ... 65534 bugs found. :doh:

            A 1 Reply Last reply
            0
            • E Emilio Garavaglia

              UNICODE is actually a set of code-points whose cardinality requires 21 bits. When encoded in sequence of 1 bye is called UTF-8 and when encoded as sequence of two bytes is called UTF-16. In UTF-8 coding may vary from 1 to 4 bytes (and remains identical for code-points between 0 and 127, aka ASCII) In UTF-16 coding may be 2 or 4 bytes (and is TWO for the most of Latin, Cyrillic and Greek characters, as many simplified Chinese). UNICODE==2bytes is a misconception that originated at the time Windows included Unicode APIS using 16bits since -at that time- Unicode specs where not so wide. Actually, reading 2bytes does not necessarily means "read a character".

              2 bugs found. > recompile ... 65534 bugs found. :doh:

              A Offline
              A Offline
              Albert Holguin
              wrote on last edited by
              #13

              unfortunately, i think it depends on what standard of C/C++ and what OS. I'm pretty sure windows defines unicode as 16bits... from their website: "Unicode: A fixed-width, 16-bit worldwide character encoding that was developed and is maintained and promoted by the Unicode Consortium, a nonprofit computer industry organization." http://msdn.microsoft.com/en-us/library/cc194793.aspx[^]

              E 1 Reply Last reply
              0
              • E Emilio Garavaglia

                This is actually a miscoception ... see here[^].

                2 bugs found. > recompile ... 65534 bugs found. :doh:

                A Offline
                A Offline
                Albert Holguin
                wrote on last edited by
                #14

                Here's another reference to Microsoft: http://msdn.microsoft.com/en-us/library/2dax2h36.aspx[^]

                E 1 Reply Last reply
                0
                • A Albert Holguin

                  unfortunately, i think it depends on what standard of C/C++ and what OS. I'm pretty sure windows defines unicode as 16bits... from their website: "Unicode: A fixed-width, 16-bit worldwide character encoding that was developed and is maintained and promoted by the Unicode Consortium, a nonprofit computer industry organization." http://msdn.microsoft.com/en-us/library/cc194793.aspx[^]

                  E Offline
                  E Offline
                  Emilio Garavaglia
                  wrote on last edited by
                  #15

                  I'm sorry for you and for Microsoft, but the one and only entitled to say what Unicode was, is and will be is www.unicode.org[^] The page you linked is a very shame for Microsoft. A technical document like that cannot be written without specifying in the page itself a data when it was written (hey ... they speak about their new amazing Windows NT 3.5 ...) and for this sole fault should disqualify M$ of whatever authority in the field.

                  2 bugs found. > recompile ... 65534 bugs found. :doh:

                  A 1 Reply Last reply
                  0
                  • E Emilio Garavaglia

                    I'm sorry for you and for Microsoft, but the one and only entitled to say what Unicode was, is and will be is www.unicode.org[^] The page you linked is a very shame for Microsoft. A technical document like that cannot be written without specifying in the page itself a data when it was written (hey ... they speak about their new amazing Windows NT 3.5 ...) and for this sole fault should disqualify M$ of whatever authority in the field.

                    2 bugs found. > recompile ... 65534 bugs found. :doh:

                    A Offline
                    A Offline
                    Albert Holguin
                    wrote on last edited by
                    #16

                    so angry! :laugh: ...similar articles found in the MS VS2010 area of MSDN... i don't do much in unicode so haven't needed to worry about it...

                    1 Reply Last reply
                    0
                    • A Albert Holguin

                      Here's another reference to Microsoft: http://msdn.microsoft.com/en-us/library/2dax2h36.aspx[^]

                      E Offline
                      E Offline
                      Emilio Garavaglia
                      wrote on last edited by
                      #17

                      Nope. 1) Unicode is not a Microsoft product. What UNICODE is is defined by www.unicode.org[^] 2) Microsoft use to encode UNICODE into 16-bits units. That is a technique well defined by the UNICODE standard itself, known as UTF-16. Essentially, every code not in the range 0xD800-0xDFFF and lower than 0xFFFF is code as itself. Every other greater that 0xFFFF is broken in two 10-bits chunks, or-ed with 0xD800 and 0xDC00 respectively. The range 0xD800 - 0xDFFF is called "UNICODE surropgate" and does not contain valid codepoints. So you can have single unicode characters requiring two wchar_t in sequence to be represented and sequences of two wchar_t representing a single character, with code greater than 0xFFFF (typical for CJK - Chinese, Japanese, Corean characters).

                      2 bugs found. > recompile ... 65534 bugs found. :doh:

                      A 1 Reply Last reply
                      0
                      • E Emilio Garavaglia

                        Nope. 1) Unicode is not a Microsoft product. What UNICODE is is defined by www.unicode.org[^] 2) Microsoft use to encode UNICODE into 16-bits units. That is a technique well defined by the UNICODE standard itself, known as UTF-16. Essentially, every code not in the range 0xD800-0xDFFF and lower than 0xFFFF is code as itself. Every other greater that 0xFFFF is broken in two 10-bits chunks, or-ed with 0xD800 and 0xDC00 respectively. The range 0xD800 - 0xDFFF is called "UNICODE surropgate" and does not contain valid codepoints. So you can have single unicode characters requiring two wchar_t in sequence to be represented and sequences of two wchar_t representing a single character, with code greater than 0xFFFF (typical for CJK - Chinese, Japanese, Corean characters).

                        2 bugs found. > recompile ... 65534 bugs found. :doh:

                        A Offline
                        A Offline
                        Albert Holguin
                        wrote on last edited by
                        #18

                        i certainly believe your point about unicode consortium being the authority... no argument there! :)

                        1 Reply Last reply
                        0
                        • L Le rner

                          i m using this now but its not successful.

                          FILE *stream;
                          char buffer[2];
                          int kk, ch;

                             // Open file to read line from:
                             fopen\_s( &stream, OpenFile, "r" );
                             if( stream == NULL )
                             {
                          	   return ;
                          	  //exit( 0 );
                             }
                          
                             // Read in first 80 characters and place them in "buffer": 
                             ch = fgetc( stream );
                             for( kk=0; (kk < 1 ) && ( feof( stream ) == 0 ); kk++ )
                             {
                          	  buffer\[kk\] = (char)ch;
                          	  ch = fgetc( stream );
                             }
                          
                             // Add null to end string 
                             buffer\[kk\] = '\\0';
                             printf( "%s\\n", buffer );
                             fclose( stream );
                          

                          here its unable to open the file and stream is alwaz null thats why its return.

                          M Offline
                          M Offline
                          malaugh
                          wrote on last edited by
                          #19

                          If you have the program set to unicode, you need to use _wfopen_s to open the file, and the filename (OpenFile) needs to be specified as wchar_t something like wchar_t Myfile[] = "my_file.ext"; Then you should be able to use fgetwc to get the characters using ch = fgetwc( stream ); your should specify ch as wchar_t

                          1 Reply Last reply
                          0
                          Reply
                          • Reply as topic
                          Log in to reply
                          • Oldest to Newest
                          • Newest to Oldest
                          • Most Votes


                          • Login

                          • Don't have an account? Register

                          • Login or register to search.
                          • First post
                            Last post
                          0
                          • Categories
                          • Recent
                          • Tags
                          • Popular
                          • World
                          • Users
                          • Groups