Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C / C++ / MFC
  4. How can read a unicode text file as character by character?

How can read a unicode text file as character by character?

Scheduled Pinned Locked Moved C / C++ / MFC
question
19 Posts 7 Posters 2 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • P pix_programmer

    Use getch().

    L Offline
    L Offline
    Le rner
    wrote on last edited by
    #3

    how?

    1 Reply Last reply
    0
    • L Le rner

      Hi all, i have and unicode file, i wanna read this file by character wise,means read only one character at a time. please tell me how can i do this. thanks in advace.

      CPalliniC Offline
      CPalliniC Offline
      CPallini
      wrote on last edited by
      #4

      What kind of Unicode text file are you dealing with (e.g. UTF-8, etc..)? What do you mean with 'character' (e.g. 'Unicode character' or 'single byte'?)? :)

      If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
      This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
      [My articles]

      In testa che avete, signor di Ceprano?

      L 1 Reply Last reply
      0
      • L Le rner

        Hi all, i have and unicode file, i wanna read this file by character wise,means read only one character at a time. please tell me how can i do this. thanks in advace.

        L Offline
        L Offline
        Lost User
        wrote on last edited by
        #5

        Use fgetc()/fgetwc()[^] or wcin[^]. It's exactly the same process as reading non-Unicode.

        I must get a clever new signature for 2011.

        L 1 Reply Last reply
        0
        • CPalliniC CPallini

          What kind of Unicode text file are you dealing with (e.g. UTF-8, etc..)? What do you mean with 'character' (e.g. 'Unicode character' or 'single byte'?)? :)

          If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
          This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
          [My articles]

          L Offline
          L Offline
          Le rner
          wrote on last edited by
          #6

          CPallini wrote:

          What kind of Unicode text file are you dealing with (e.g. UTF-8, etc..)?
          What do you mean with 'character' (e.g. 'Unicode character' or 'single byte'?)?

          Using Unicode encoding type file and one character means single byte.

          CPalliniC A E 3 Replies Last reply
          0
          • L Le rner

            CPallini wrote:

            What kind of Unicode text file are you dealing with (e.g. UTF-8, etc..)?
            What do you mean with 'character' (e.g. 'Unicode character' or 'single byte'?)?

            Using Unicode encoding type file and one character means single byte.

            CPalliniC Offline
            CPalliniC Offline
            CPallini
            wrote on last edited by
            #7

            If you want to read a byte at time then use, as already suggested, fgetc[^]. :)

            If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
            This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
            [My articles]

            In testa che avete, signor di Ceprano?

            1 Reply Last reply
            0
            • L Lost User

              Use fgetc()/fgetwc()[^] or wcin[^]. It's exactly the same process as reading non-Unicode.

              I must get a clever new signature for 2011.

              L Offline
              L Offline
              Le rner
              wrote on last edited by
              #8

              i m using this now but its not successful.

              FILE *stream;
              char buffer[2];
              int kk, ch;

                 // Open file to read line from:
                 fopen\_s( &stream, OpenFile, "r" );
                 if( stream == NULL )
                 {
              	   return ;
              	  //exit( 0 );
                 }
              
                 // Read in first 80 characters and place them in "buffer": 
                 ch = fgetc( stream );
                 for( kk=0; (kk < 1 ) && ( feof( stream ) == 0 ); kk++ )
                 {
              	  buffer\[kk\] = (char)ch;
              	  ch = fgetc( stream );
                 }
              
                 // Add null to end string 
                 buffer\[kk\] = '\\0';
                 printf( "%s\\n", buffer );
                 fclose( stream );
              

              here its unable to open the file and stream is alwaz null thats why its return.

              L M 2 Replies Last reply
              0
              • L Le rner

                i m using this now but its not successful.

                FILE *stream;
                char buffer[2];
                int kk, ch;

                   // Open file to read line from:
                   fopen\_s( &stream, OpenFile, "r" );
                   if( stream == NULL )
                   {
                	   return ;
                	  //exit( 0 );
                   }
                
                   // Read in first 80 characters and place them in "buffer": 
                   ch = fgetc( stream );
                   for( kk=0; (kk < 1 ) && ( feof( stream ) == 0 ); kk++ )
                   {
                	  buffer\[kk\] = (char)ch;
                	  ch = fgetc( stream );
                   }
                
                   // Add null to end string 
                   buffer\[kk\] = '\\0';
                   printf( "%s\\n", buffer );
                   fclose( stream );
                

                here its unable to open the file and stream is alwaz null thats why its return.

                L Offline
                L Offline
                Lost User
                wrote on last edited by
                #9

                You are ignoring the return code from fopen_s so it's impossible to diagnose your problem. Use something like the following and look up the error code that you receive.

                errno_t errNum = fopen_s( &stream, OpenFile, "r" );
                if (errNum != 0)
                {
                // add some code here to display the error value
                // or set a breakpoint and inspect its contents
                }

                I would suggest you look at the documentation here[^] for further guidance. Incidentally, the rest of your code does not seem to be set up to process Unicode data, which was the subject of your original question.

                I must get a clever new signature for 2011.

                1 Reply Last reply
                0
                • L Le rner

                  CPallini wrote:

                  What kind of Unicode text file are you dealing with (e.g. UTF-8, etc..)?
                  What do you mean with 'character' (e.g. 'Unicode character' or 'single byte'?)?

                  Using Unicode encoding type file and one character means single byte.

                  A Offline
                  A Offline
                  Albert Holguin
                  wrote on last edited by
                  #10

                  Hope you do know that unicode is two bytes and ascii is one byte...

                  E 1 Reply Last reply
                  0
                  • A Albert Holguin

                    Hope you do know that unicode is two bytes and ascii is one byte...

                    E Offline
                    E Offline
                    Emilio Garavaglia
                    wrote on last edited by
                    #11

                    UNICODE is actually a set of code-points whose cardinality requires 21 bits. When encoded in sequence of 1 bye is called UTF-8 and when encoded as sequence of two bytes is called UTF-16. In UTF-8 coding may vary from 1 to 4 bytes (and remains identical for code-points between 0 and 127, aka ASCII) In UTF-16 coding may be 2 or 4 bytes (and is TWO for the most of Latin, Cyrillic and Greek characters, as many simplified Chinese). UNICODE==2bytes is a misconception that originated at the time Windows included Unicode APIS using 16bits since -at that time- Unicode specs where not so wide. Actually, reading 2bytes does not necessarily means "read a character".

                    2 bugs found. > recompile ... 65534 bugs found. :doh:

                    A 1 Reply Last reply
                    0
                    • L Le rner

                      CPallini wrote:

                      What kind of Unicode text file are you dealing with (e.g. UTF-8, etc..)?
                      What do you mean with 'character' (e.g. 'Unicode character' or 'single byte'?)?

                      Using Unicode encoding type file and one character means single byte.

                      E Offline
                      E Offline
                      Emilio Garavaglia
                      wrote on last edited by
                      #12

                      This is actually a miscoception ... see here[^].

                      2 bugs found. > recompile ... 65534 bugs found. :doh:

                      A 1 Reply Last reply
                      0
                      • E Emilio Garavaglia

                        UNICODE is actually a set of code-points whose cardinality requires 21 bits. When encoded in sequence of 1 bye is called UTF-8 and when encoded as sequence of two bytes is called UTF-16. In UTF-8 coding may vary from 1 to 4 bytes (and remains identical for code-points between 0 and 127, aka ASCII) In UTF-16 coding may be 2 or 4 bytes (and is TWO for the most of Latin, Cyrillic and Greek characters, as many simplified Chinese). UNICODE==2bytes is a misconception that originated at the time Windows included Unicode APIS using 16bits since -at that time- Unicode specs where not so wide. Actually, reading 2bytes does not necessarily means "read a character".

                        2 bugs found. > recompile ... 65534 bugs found. :doh:

                        A Offline
                        A Offline
                        Albert Holguin
                        wrote on last edited by
                        #13

                        unfortunately, i think it depends on what standard of C/C++ and what OS. I'm pretty sure windows defines unicode as 16bits... from their website: "Unicode: A fixed-width, 16-bit worldwide character encoding that was developed and is maintained and promoted by the Unicode Consortium, a nonprofit computer industry organization." http://msdn.microsoft.com/en-us/library/cc194793.aspx[^]

                        E 1 Reply Last reply
                        0
                        • E Emilio Garavaglia

                          This is actually a miscoception ... see here[^].

                          2 bugs found. > recompile ... 65534 bugs found. :doh:

                          A Offline
                          A Offline
                          Albert Holguin
                          wrote on last edited by
                          #14

                          Here's another reference to Microsoft: http://msdn.microsoft.com/en-us/library/2dax2h36.aspx[^]

                          E 1 Reply Last reply
                          0
                          • A Albert Holguin

                            unfortunately, i think it depends on what standard of C/C++ and what OS. I'm pretty sure windows defines unicode as 16bits... from their website: "Unicode: A fixed-width, 16-bit worldwide character encoding that was developed and is maintained and promoted by the Unicode Consortium, a nonprofit computer industry organization." http://msdn.microsoft.com/en-us/library/cc194793.aspx[^]

                            E Offline
                            E Offline
                            Emilio Garavaglia
                            wrote on last edited by
                            #15

                            I'm sorry for you and for Microsoft, but the one and only entitled to say what Unicode was, is and will be is www.unicode.org[^] The page you linked is a very shame for Microsoft. A technical document like that cannot be written without specifying in the page itself a data when it was written (hey ... they speak about their new amazing Windows NT 3.5 ...) and for this sole fault should disqualify M$ of whatever authority in the field.

                            2 bugs found. > recompile ... 65534 bugs found. :doh:

                            A 1 Reply Last reply
                            0
                            • E Emilio Garavaglia

                              I'm sorry for you and for Microsoft, but the one and only entitled to say what Unicode was, is and will be is www.unicode.org[^] The page you linked is a very shame for Microsoft. A technical document like that cannot be written without specifying in the page itself a data when it was written (hey ... they speak about their new amazing Windows NT 3.5 ...) and for this sole fault should disqualify M$ of whatever authority in the field.

                              2 bugs found. > recompile ... 65534 bugs found. :doh:

                              A Offline
                              A Offline
                              Albert Holguin
                              wrote on last edited by
                              #16

                              so angry! :laugh: ...similar articles found in the MS VS2010 area of MSDN... i don't do much in unicode so haven't needed to worry about it...

                              1 Reply Last reply
                              0
                              • A Albert Holguin

                                Here's another reference to Microsoft: http://msdn.microsoft.com/en-us/library/2dax2h36.aspx[^]

                                E Offline
                                E Offline
                                Emilio Garavaglia
                                wrote on last edited by
                                #17

                                Nope. 1) Unicode is not a Microsoft product. What UNICODE is is defined by www.unicode.org[^] 2) Microsoft use to encode UNICODE into 16-bits units. That is a technique well defined by the UNICODE standard itself, known as UTF-16. Essentially, every code not in the range 0xD800-0xDFFF and lower than 0xFFFF is code as itself. Every other greater that 0xFFFF is broken in two 10-bits chunks, or-ed with 0xD800 and 0xDC00 respectively. The range 0xD800 - 0xDFFF is called "UNICODE surropgate" and does not contain valid codepoints. So you can have single unicode characters requiring two wchar_t in sequence to be represented and sequences of two wchar_t representing a single character, with code greater than 0xFFFF (typical for CJK - Chinese, Japanese, Corean characters).

                                2 bugs found. > recompile ... 65534 bugs found. :doh:

                                A 1 Reply Last reply
                                0
                                • E Emilio Garavaglia

                                  Nope. 1) Unicode is not a Microsoft product. What UNICODE is is defined by www.unicode.org[^] 2) Microsoft use to encode UNICODE into 16-bits units. That is a technique well defined by the UNICODE standard itself, known as UTF-16. Essentially, every code not in the range 0xD800-0xDFFF and lower than 0xFFFF is code as itself. Every other greater that 0xFFFF is broken in two 10-bits chunks, or-ed with 0xD800 and 0xDC00 respectively. The range 0xD800 - 0xDFFF is called "UNICODE surropgate" and does not contain valid codepoints. So you can have single unicode characters requiring two wchar_t in sequence to be represented and sequences of two wchar_t representing a single character, with code greater than 0xFFFF (typical for CJK - Chinese, Japanese, Corean characters).

                                  2 bugs found. > recompile ... 65534 bugs found. :doh:

                                  A Offline
                                  A Offline
                                  Albert Holguin
                                  wrote on last edited by
                                  #18

                                  i certainly believe your point about unicode consortium being the authority... no argument there! :)

                                  1 Reply Last reply
                                  0
                                  • L Le rner

                                    i m using this now but its not successful.

                                    FILE *stream;
                                    char buffer[2];
                                    int kk, ch;

                                       // Open file to read line from:
                                       fopen\_s( &stream, OpenFile, "r" );
                                       if( stream == NULL )
                                       {
                                    	   return ;
                                    	  //exit( 0 );
                                       }
                                    
                                       // Read in first 80 characters and place them in "buffer": 
                                       ch = fgetc( stream );
                                       for( kk=0; (kk < 1 ) && ( feof( stream ) == 0 ); kk++ )
                                       {
                                    	  buffer\[kk\] = (char)ch;
                                    	  ch = fgetc( stream );
                                       }
                                    
                                       // Add null to end string 
                                       buffer\[kk\] = '\\0';
                                       printf( "%s\\n", buffer );
                                       fclose( stream );
                                    

                                    here its unable to open the file and stream is alwaz null thats why its return.

                                    M Offline
                                    M Offline
                                    malaugh
                                    wrote on last edited by
                                    #19

                                    If you have the program set to unicode, you need to use _wfopen_s to open the file, and the filename (OpenFile) needs to be specified as wchar_t something like wchar_t Myfile[] = "my_file.ext"; Then you should be able to use fgetwc to get the characters using ch = fgetwc( stream ); your should specify ch as wchar_t

                                    1 Reply Last reply
                                    0
                                    Reply
                                    • Reply as topic
                                    Log in to reply
                                    • Oldest to Newest
                                    • Newest to Oldest
                                    • Most Votes


                                    • Login

                                    • Don't have an account? Register

                                    • Login or register to search.
                                    • First post
                                      Last post
                                    0
                                    • Categories
                                    • Recent
                                    • Tags
                                    • Popular
                                    • World
                                    • Users
                                    • Groups