Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C / C++ / MFC
  4. How can read a unicode text file as character by character?

How can read a unicode text file as character by character?

Scheduled Pinned Locked Moved C / C++ / MFC
question
19 Posts 7 Posters 2 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • CPalliniC CPallini

    What kind of Unicode text file are you dealing with (e.g. UTF-8, etc..)? What do you mean with 'character' (e.g. 'Unicode character' or 'single byte'?)? :)

    If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
    This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
    [My articles]

    L Offline
    L Offline
    Le rner
    wrote on last edited by
    #6

    CPallini wrote:

    What kind of Unicode text file are you dealing with (e.g. UTF-8, etc..)?
    What do you mean with 'character' (e.g. 'Unicode character' or 'single byte'?)?

    Using Unicode encoding type file and one character means single byte.

    CPalliniC A E 3 Replies Last reply
    0
    • L Le rner

      CPallini wrote:

      What kind of Unicode text file are you dealing with (e.g. UTF-8, etc..)?
      What do you mean with 'character' (e.g. 'Unicode character' or 'single byte'?)?

      Using Unicode encoding type file and one character means single byte.

      CPalliniC Offline
      CPalliniC Offline
      CPallini
      wrote on last edited by
      #7

      If you want to read a byte at time then use, as already suggested, fgetc[^]. :)

      If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
      This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
      [My articles]

      In testa che avete, signor di Ceprano?

      1 Reply Last reply
      0
      • L Lost User

        Use fgetc()/fgetwc()[^] or wcin[^]. It's exactly the same process as reading non-Unicode.

        I must get a clever new signature for 2011.

        L Offline
        L Offline
        Le rner
        wrote on last edited by
        #8

        i m using this now but its not successful.

        FILE *stream;
        char buffer[2];
        int kk, ch;

           // Open file to read line from:
           fopen\_s( &stream, OpenFile, "r" );
           if( stream == NULL )
           {
        	   return ;
        	  //exit( 0 );
           }
        
           // Read in first 80 characters and place them in "buffer": 
           ch = fgetc( stream );
           for( kk=0; (kk < 1 ) && ( feof( stream ) == 0 ); kk++ )
           {
        	  buffer\[kk\] = (char)ch;
        	  ch = fgetc( stream );
           }
        
           // Add null to end string 
           buffer\[kk\] = '\\0';
           printf( "%s\\n", buffer );
           fclose( stream );
        

        here its unable to open the file and stream is alwaz null thats why its return.

        L M 2 Replies Last reply
        0
        • L Le rner

          i m using this now but its not successful.

          FILE *stream;
          char buffer[2];
          int kk, ch;

             // Open file to read line from:
             fopen\_s( &stream, OpenFile, "r" );
             if( stream == NULL )
             {
          	   return ;
          	  //exit( 0 );
             }
          
             // Read in first 80 characters and place them in "buffer": 
             ch = fgetc( stream );
             for( kk=0; (kk < 1 ) && ( feof( stream ) == 0 ); kk++ )
             {
          	  buffer\[kk\] = (char)ch;
          	  ch = fgetc( stream );
             }
          
             // Add null to end string 
             buffer\[kk\] = '\\0';
             printf( "%s\\n", buffer );
             fclose( stream );
          

          here its unable to open the file and stream is alwaz null thats why its return.

          L Offline
          L Offline
          Lost User
          wrote on last edited by
          #9

          You are ignoring the return code from fopen_s so it's impossible to diagnose your problem. Use something like the following and look up the error code that you receive.

          errno_t errNum = fopen_s( &stream, OpenFile, "r" );
          if (errNum != 0)
          {
          // add some code here to display the error value
          // or set a breakpoint and inspect its contents
          }

          I would suggest you look at the documentation here[^] for further guidance. Incidentally, the rest of your code does not seem to be set up to process Unicode data, which was the subject of your original question.

          I must get a clever new signature for 2011.

          1 Reply Last reply
          0
          • L Le rner

            CPallini wrote:

            What kind of Unicode text file are you dealing with (e.g. UTF-8, etc..)?
            What do you mean with 'character' (e.g. 'Unicode character' or 'single byte'?)?

            Using Unicode encoding type file and one character means single byte.

            A Offline
            A Offline
            Albert Holguin
            wrote on last edited by
            #10

            Hope you do know that unicode is two bytes and ascii is one byte...

            E 1 Reply Last reply
            0
            • A Albert Holguin

              Hope you do know that unicode is two bytes and ascii is one byte...

              E Offline
              E Offline
              Emilio Garavaglia
              wrote on last edited by
              #11

              UNICODE is actually a set of code-points whose cardinality requires 21 bits. When encoded in sequence of 1 bye is called UTF-8 and when encoded as sequence of two bytes is called UTF-16. In UTF-8 coding may vary from 1 to 4 bytes (and remains identical for code-points between 0 and 127, aka ASCII) In UTF-16 coding may be 2 or 4 bytes (and is TWO for the most of Latin, Cyrillic and Greek characters, as many simplified Chinese). UNICODE==2bytes is a misconception that originated at the time Windows included Unicode APIS using 16bits since -at that time- Unicode specs where not so wide. Actually, reading 2bytes does not necessarily means "read a character".

              2 bugs found. > recompile ... 65534 bugs found. :doh:

              A 1 Reply Last reply
              0
              • L Le rner

                CPallini wrote:

                What kind of Unicode text file are you dealing with (e.g. UTF-8, etc..)?
                What do you mean with 'character' (e.g. 'Unicode character' or 'single byte'?)?

                Using Unicode encoding type file and one character means single byte.

                E Offline
                E Offline
                Emilio Garavaglia
                wrote on last edited by
                #12

                This is actually a miscoception ... see here[^].

                2 bugs found. > recompile ... 65534 bugs found. :doh:

                A 1 Reply Last reply
                0
                • E Emilio Garavaglia

                  UNICODE is actually a set of code-points whose cardinality requires 21 bits. When encoded in sequence of 1 bye is called UTF-8 and when encoded as sequence of two bytes is called UTF-16. In UTF-8 coding may vary from 1 to 4 bytes (and remains identical for code-points between 0 and 127, aka ASCII) In UTF-16 coding may be 2 or 4 bytes (and is TWO for the most of Latin, Cyrillic and Greek characters, as many simplified Chinese). UNICODE==2bytes is a misconception that originated at the time Windows included Unicode APIS using 16bits since -at that time- Unicode specs where not so wide. Actually, reading 2bytes does not necessarily means "read a character".

                  2 bugs found. > recompile ... 65534 bugs found. :doh:

                  A Offline
                  A Offline
                  Albert Holguin
                  wrote on last edited by
                  #13

                  unfortunately, i think it depends on what standard of C/C++ and what OS. I'm pretty sure windows defines unicode as 16bits... from their website: "Unicode: A fixed-width, 16-bit worldwide character encoding that was developed and is maintained and promoted by the Unicode Consortium, a nonprofit computer industry organization." http://msdn.microsoft.com/en-us/library/cc194793.aspx[^]

                  E 1 Reply Last reply
                  0
                  • E Emilio Garavaglia

                    This is actually a miscoception ... see here[^].

                    2 bugs found. > recompile ... 65534 bugs found. :doh:

                    A Offline
                    A Offline
                    Albert Holguin
                    wrote on last edited by
                    #14

                    Here's another reference to Microsoft: http://msdn.microsoft.com/en-us/library/2dax2h36.aspx[^]

                    E 1 Reply Last reply
                    0
                    • A Albert Holguin

                      unfortunately, i think it depends on what standard of C/C++ and what OS. I'm pretty sure windows defines unicode as 16bits... from their website: "Unicode: A fixed-width, 16-bit worldwide character encoding that was developed and is maintained and promoted by the Unicode Consortium, a nonprofit computer industry organization." http://msdn.microsoft.com/en-us/library/cc194793.aspx[^]

                      E Offline
                      E Offline
                      Emilio Garavaglia
                      wrote on last edited by
                      #15

                      I'm sorry for you and for Microsoft, but the one and only entitled to say what Unicode was, is and will be is www.unicode.org[^] The page you linked is a very shame for Microsoft. A technical document like that cannot be written without specifying in the page itself a data when it was written (hey ... they speak about their new amazing Windows NT 3.5 ...) and for this sole fault should disqualify M$ of whatever authority in the field.

                      2 bugs found. > recompile ... 65534 bugs found. :doh:

                      A 1 Reply Last reply
                      0
                      • E Emilio Garavaglia

                        I'm sorry for you and for Microsoft, but the one and only entitled to say what Unicode was, is and will be is www.unicode.org[^] The page you linked is a very shame for Microsoft. A technical document like that cannot be written without specifying in the page itself a data when it was written (hey ... they speak about their new amazing Windows NT 3.5 ...) and for this sole fault should disqualify M$ of whatever authority in the field.

                        2 bugs found. > recompile ... 65534 bugs found. :doh:

                        A Offline
                        A Offline
                        Albert Holguin
                        wrote on last edited by
                        #16

                        so angry! :laugh: ...similar articles found in the MS VS2010 area of MSDN... i don't do much in unicode so haven't needed to worry about it...

                        1 Reply Last reply
                        0
                        • A Albert Holguin

                          Here's another reference to Microsoft: http://msdn.microsoft.com/en-us/library/2dax2h36.aspx[^]

                          E Offline
                          E Offline
                          Emilio Garavaglia
                          wrote on last edited by
                          #17

                          Nope. 1) Unicode is not a Microsoft product. What UNICODE is is defined by www.unicode.org[^] 2) Microsoft use to encode UNICODE into 16-bits units. That is a technique well defined by the UNICODE standard itself, known as UTF-16. Essentially, every code not in the range 0xD800-0xDFFF and lower than 0xFFFF is code as itself. Every other greater that 0xFFFF is broken in two 10-bits chunks, or-ed with 0xD800 and 0xDC00 respectively. The range 0xD800 - 0xDFFF is called "UNICODE surropgate" and does not contain valid codepoints. So you can have single unicode characters requiring two wchar_t in sequence to be represented and sequences of two wchar_t representing a single character, with code greater than 0xFFFF (typical for CJK - Chinese, Japanese, Corean characters).

                          2 bugs found. > recompile ... 65534 bugs found. :doh:

                          A 1 Reply Last reply
                          0
                          • E Emilio Garavaglia

                            Nope. 1) Unicode is not a Microsoft product. What UNICODE is is defined by www.unicode.org[^] 2) Microsoft use to encode UNICODE into 16-bits units. That is a technique well defined by the UNICODE standard itself, known as UTF-16. Essentially, every code not in the range 0xD800-0xDFFF and lower than 0xFFFF is code as itself. Every other greater that 0xFFFF is broken in two 10-bits chunks, or-ed with 0xD800 and 0xDC00 respectively. The range 0xD800 - 0xDFFF is called "UNICODE surropgate" and does not contain valid codepoints. So you can have single unicode characters requiring two wchar_t in sequence to be represented and sequences of two wchar_t representing a single character, with code greater than 0xFFFF (typical for CJK - Chinese, Japanese, Corean characters).

                            2 bugs found. > recompile ... 65534 bugs found. :doh:

                            A Offline
                            A Offline
                            Albert Holguin
                            wrote on last edited by
                            #18

                            i certainly believe your point about unicode consortium being the authority... no argument there! :)

                            1 Reply Last reply
                            0
                            • L Le rner

                              i m using this now but its not successful.

                              FILE *stream;
                              char buffer[2];
                              int kk, ch;

                                 // Open file to read line from:
                                 fopen\_s( &stream, OpenFile, "r" );
                                 if( stream == NULL )
                                 {
                              	   return ;
                              	  //exit( 0 );
                                 }
                              
                                 // Read in first 80 characters and place them in "buffer": 
                                 ch = fgetc( stream );
                                 for( kk=0; (kk < 1 ) && ( feof( stream ) == 0 ); kk++ )
                                 {
                              	  buffer\[kk\] = (char)ch;
                              	  ch = fgetc( stream );
                                 }
                              
                                 // Add null to end string 
                                 buffer\[kk\] = '\\0';
                                 printf( "%s\\n", buffer );
                                 fclose( stream );
                              

                              here its unable to open the file and stream is alwaz null thats why its return.

                              M Offline
                              M Offline
                              malaugh
                              wrote on last edited by
                              #19

                              If you have the program set to unicode, you need to use _wfopen_s to open the file, and the filename (OpenFile) needs to be specified as wchar_t something like wchar_t Myfile[] = "my_file.ext"; Then you should be able to use fgetwc to get the characters using ch = fgetwc( stream ); your should specify ch as wchar_t

                              1 Reply Last reply
                              0
                              Reply
                              • Reply as topic
                              Log in to reply
                              • Oldest to Newest
                              • Newest to Oldest
                              • Most Votes


                              • Login

                              • Don't have an account? Register

                              • Login or register to search.
                              • First post
                                Last post
                              0
                              • Categories
                              • Recent
                              • Tags
                              • Popular
                              • World
                              • Users
                              • Groups