Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C / C++ / MFC
  4. How can read a unicode text file as character by character?

How can read a unicode text file as character by character?

Scheduled Pinned Locked Moved C / C++ / MFC
question
19 Posts 7 Posters 2 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • L Offline
    L Offline
    Le rner
    wrote on last edited by
    #1

    Hi all, i have and unicode file, i wanna read this file by character wise,means read only one character at a time. please tell me how can i do this. thanks in advace.

    P CPalliniC L 3 Replies Last reply
    0
    • L Le rner

      Hi all, i have and unicode file, i wanna read this file by character wise,means read only one character at a time. please tell me how can i do this. thanks in advace.

      P Offline
      P Offline
      pix_programmer
      wrote on last edited by
      #2

      Use getch().

      L 1 Reply Last reply
      0
      • P pix_programmer

        Use getch().

        L Offline
        L Offline
        Le rner
        wrote on last edited by
        #3

        how?

        1 Reply Last reply
        0
        • L Le rner

          Hi all, i have and unicode file, i wanna read this file by character wise,means read only one character at a time. please tell me how can i do this. thanks in advace.

          CPalliniC Offline
          CPalliniC Offline
          CPallini
          wrote on last edited by
          #4

          What kind of Unicode text file are you dealing with (e.g. UTF-8, etc..)? What do you mean with 'character' (e.g. 'Unicode character' or 'single byte'?)? :)

          If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
          This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
          [My articles]

          In testa che avete, signor di Ceprano?

          L 1 Reply Last reply
          0
          • L Le rner

            Hi all, i have and unicode file, i wanna read this file by character wise,means read only one character at a time. please tell me how can i do this. thanks in advace.

            L Offline
            L Offline
            Lost User
            wrote on last edited by
            #5

            Use fgetc()/fgetwc()[^] or wcin[^]. It's exactly the same process as reading non-Unicode.

            I must get a clever new signature for 2011.

            L 1 Reply Last reply
            0
            • CPalliniC CPallini

              What kind of Unicode text file are you dealing with (e.g. UTF-8, etc..)? What do you mean with 'character' (e.g. 'Unicode character' or 'single byte'?)? :)

              If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
              This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
              [My articles]

              L Offline
              L Offline
              Le rner
              wrote on last edited by
              #6

              CPallini wrote:

              What kind of Unicode text file are you dealing with (e.g. UTF-8, etc..)?
              What do you mean with 'character' (e.g. 'Unicode character' or 'single byte'?)?

              Using Unicode encoding type file and one character means single byte.

              CPalliniC A E 3 Replies Last reply
              0
              • L Le rner

                CPallini wrote:

                What kind of Unicode text file are you dealing with (e.g. UTF-8, etc..)?
                What do you mean with 'character' (e.g. 'Unicode character' or 'single byte'?)?

                Using Unicode encoding type file and one character means single byte.

                CPalliniC Offline
                CPalliniC Offline
                CPallini
                wrote on last edited by
                #7

                If you want to read a byte at time then use, as already suggested, fgetc[^]. :)

                If the Lord God Almighty had consulted me before embarking upon the Creation, I would have recommended something simpler. -- Alfonso the Wise, 13th Century King of Castile.
                This is going on my arrogant assumptions. You may have a superb reason why I'm completely wrong. -- Iain Clarke
                [My articles]

                In testa che avete, signor di Ceprano?

                1 Reply Last reply
                0
                • L Lost User

                  Use fgetc()/fgetwc()[^] or wcin[^]. It's exactly the same process as reading non-Unicode.

                  I must get a clever new signature for 2011.

                  L Offline
                  L Offline
                  Le rner
                  wrote on last edited by
                  #8

                  i m using this now but its not successful.

                  FILE *stream;
                  char buffer[2];
                  int kk, ch;

                     // Open file to read line from:
                     fopen\_s( &stream, OpenFile, "r" );
                     if( stream == NULL )
                     {
                  	   return ;
                  	  //exit( 0 );
                     }
                  
                     // Read in first 80 characters and place them in "buffer": 
                     ch = fgetc( stream );
                     for( kk=0; (kk < 1 ) && ( feof( stream ) == 0 ); kk++ )
                     {
                  	  buffer\[kk\] = (char)ch;
                  	  ch = fgetc( stream );
                     }
                  
                     // Add null to end string 
                     buffer\[kk\] = '\\0';
                     printf( "%s\\n", buffer );
                     fclose( stream );
                  

                  here its unable to open the file and stream is alwaz null thats why its return.

                  L M 2 Replies Last reply
                  0
                  • L Le rner

                    i m using this now but its not successful.

                    FILE *stream;
                    char buffer[2];
                    int kk, ch;

                       // Open file to read line from:
                       fopen\_s( &stream, OpenFile, "r" );
                       if( stream == NULL )
                       {
                    	   return ;
                    	  //exit( 0 );
                       }
                    
                       // Read in first 80 characters and place them in "buffer": 
                       ch = fgetc( stream );
                       for( kk=0; (kk < 1 ) && ( feof( stream ) == 0 ); kk++ )
                       {
                    	  buffer\[kk\] = (char)ch;
                    	  ch = fgetc( stream );
                       }
                    
                       // Add null to end string 
                       buffer\[kk\] = '\\0';
                       printf( "%s\\n", buffer );
                       fclose( stream );
                    

                    here its unable to open the file and stream is alwaz null thats why its return.

                    L Offline
                    L Offline
                    Lost User
                    wrote on last edited by
                    #9

                    You are ignoring the return code from fopen_s so it's impossible to diagnose your problem. Use something like the following and look up the error code that you receive.

                    errno_t errNum = fopen_s( &stream, OpenFile, "r" );
                    if (errNum != 0)
                    {
                    // add some code here to display the error value
                    // or set a breakpoint and inspect its contents
                    }

                    I would suggest you look at the documentation here[^] for further guidance. Incidentally, the rest of your code does not seem to be set up to process Unicode data, which was the subject of your original question.

                    I must get a clever new signature for 2011.

                    1 Reply Last reply
                    0
                    • L Le rner

                      CPallini wrote:

                      What kind of Unicode text file are you dealing with (e.g. UTF-8, etc..)?
                      What do you mean with 'character' (e.g. 'Unicode character' or 'single byte'?)?

                      Using Unicode encoding type file and one character means single byte.

                      A Offline
                      A Offline
                      Albert Holguin
                      wrote on last edited by
                      #10

                      Hope you do know that unicode is two bytes and ascii is one byte...

                      E 1 Reply Last reply
                      0
                      • A Albert Holguin

                        Hope you do know that unicode is two bytes and ascii is one byte...

                        E Offline
                        E Offline
                        Emilio Garavaglia
                        wrote on last edited by
                        #11

                        UNICODE is actually a set of code-points whose cardinality requires 21 bits. When encoded in sequence of 1 bye is called UTF-8 and when encoded as sequence of two bytes is called UTF-16. In UTF-8 coding may vary from 1 to 4 bytes (and remains identical for code-points between 0 and 127, aka ASCII) In UTF-16 coding may be 2 or 4 bytes (and is TWO for the most of Latin, Cyrillic and Greek characters, as many simplified Chinese). UNICODE==2bytes is a misconception that originated at the time Windows included Unicode APIS using 16bits since -at that time- Unicode specs where not so wide. Actually, reading 2bytes does not necessarily means "read a character".

                        2 bugs found. > recompile ... 65534 bugs found. :doh:

                        A 1 Reply Last reply
                        0
                        • L Le rner

                          CPallini wrote:

                          What kind of Unicode text file are you dealing with (e.g. UTF-8, etc..)?
                          What do you mean with 'character' (e.g. 'Unicode character' or 'single byte'?)?

                          Using Unicode encoding type file and one character means single byte.

                          E Offline
                          E Offline
                          Emilio Garavaglia
                          wrote on last edited by
                          #12

                          This is actually a miscoception ... see here[^].

                          2 bugs found. > recompile ... 65534 bugs found. :doh:

                          A 1 Reply Last reply
                          0
                          • E Emilio Garavaglia

                            UNICODE is actually a set of code-points whose cardinality requires 21 bits. When encoded in sequence of 1 bye is called UTF-8 and when encoded as sequence of two bytes is called UTF-16. In UTF-8 coding may vary from 1 to 4 bytes (and remains identical for code-points between 0 and 127, aka ASCII) In UTF-16 coding may be 2 or 4 bytes (and is TWO for the most of Latin, Cyrillic and Greek characters, as many simplified Chinese). UNICODE==2bytes is a misconception that originated at the time Windows included Unicode APIS using 16bits since -at that time- Unicode specs where not so wide. Actually, reading 2bytes does not necessarily means "read a character".

                            2 bugs found. > recompile ... 65534 bugs found. :doh:

                            A Offline
                            A Offline
                            Albert Holguin
                            wrote on last edited by
                            #13

                            unfortunately, i think it depends on what standard of C/C++ and what OS. I'm pretty sure windows defines unicode as 16bits... from their website: "Unicode: A fixed-width, 16-bit worldwide character encoding that was developed and is maintained and promoted by the Unicode Consortium, a nonprofit computer industry organization." http://msdn.microsoft.com/en-us/library/cc194793.aspx[^]

                            E 1 Reply Last reply
                            0
                            • E Emilio Garavaglia

                              This is actually a miscoception ... see here[^].

                              2 bugs found. > recompile ... 65534 bugs found. :doh:

                              A Offline
                              A Offline
                              Albert Holguin
                              wrote on last edited by
                              #14

                              Here's another reference to Microsoft: http://msdn.microsoft.com/en-us/library/2dax2h36.aspx[^]

                              E 1 Reply Last reply
                              0
                              • A Albert Holguin

                                unfortunately, i think it depends on what standard of C/C++ and what OS. I'm pretty sure windows defines unicode as 16bits... from their website: "Unicode: A fixed-width, 16-bit worldwide character encoding that was developed and is maintained and promoted by the Unicode Consortium, a nonprofit computer industry organization." http://msdn.microsoft.com/en-us/library/cc194793.aspx[^]

                                E Offline
                                E Offline
                                Emilio Garavaglia
                                wrote on last edited by
                                #15

                                I'm sorry for you and for Microsoft, but the one and only entitled to say what Unicode was, is and will be is www.unicode.org[^] The page you linked is a very shame for Microsoft. A technical document like that cannot be written without specifying in the page itself a data when it was written (hey ... they speak about their new amazing Windows NT 3.5 ...) and for this sole fault should disqualify M$ of whatever authority in the field.

                                2 bugs found. > recompile ... 65534 bugs found. :doh:

                                A 1 Reply Last reply
                                0
                                • E Emilio Garavaglia

                                  I'm sorry for you and for Microsoft, but the one and only entitled to say what Unicode was, is and will be is www.unicode.org[^] The page you linked is a very shame for Microsoft. A technical document like that cannot be written without specifying in the page itself a data when it was written (hey ... they speak about their new amazing Windows NT 3.5 ...) and for this sole fault should disqualify M$ of whatever authority in the field.

                                  2 bugs found. > recompile ... 65534 bugs found. :doh:

                                  A Offline
                                  A Offline
                                  Albert Holguin
                                  wrote on last edited by
                                  #16

                                  so angry! :laugh: ...similar articles found in the MS VS2010 area of MSDN... i don't do much in unicode so haven't needed to worry about it...

                                  1 Reply Last reply
                                  0
                                  • A Albert Holguin

                                    Here's another reference to Microsoft: http://msdn.microsoft.com/en-us/library/2dax2h36.aspx[^]

                                    E Offline
                                    E Offline
                                    Emilio Garavaglia
                                    wrote on last edited by
                                    #17

                                    Nope. 1) Unicode is not a Microsoft product. What UNICODE is is defined by www.unicode.org[^] 2) Microsoft use to encode UNICODE into 16-bits units. That is a technique well defined by the UNICODE standard itself, known as UTF-16. Essentially, every code not in the range 0xD800-0xDFFF and lower than 0xFFFF is code as itself. Every other greater that 0xFFFF is broken in two 10-bits chunks, or-ed with 0xD800 and 0xDC00 respectively. The range 0xD800 - 0xDFFF is called "UNICODE surropgate" and does not contain valid codepoints. So you can have single unicode characters requiring two wchar_t in sequence to be represented and sequences of two wchar_t representing a single character, with code greater than 0xFFFF (typical for CJK - Chinese, Japanese, Corean characters).

                                    2 bugs found. > recompile ... 65534 bugs found. :doh:

                                    A 1 Reply Last reply
                                    0
                                    • E Emilio Garavaglia

                                      Nope. 1) Unicode is not a Microsoft product. What UNICODE is is defined by www.unicode.org[^] 2) Microsoft use to encode UNICODE into 16-bits units. That is a technique well defined by the UNICODE standard itself, known as UTF-16. Essentially, every code not in the range 0xD800-0xDFFF and lower than 0xFFFF is code as itself. Every other greater that 0xFFFF is broken in two 10-bits chunks, or-ed with 0xD800 and 0xDC00 respectively. The range 0xD800 - 0xDFFF is called "UNICODE surropgate" and does not contain valid codepoints. So you can have single unicode characters requiring two wchar_t in sequence to be represented and sequences of two wchar_t representing a single character, with code greater than 0xFFFF (typical for CJK - Chinese, Japanese, Corean characters).

                                      2 bugs found. > recompile ... 65534 bugs found. :doh:

                                      A Offline
                                      A Offline
                                      Albert Holguin
                                      wrote on last edited by
                                      #18

                                      i certainly believe your point about unicode consortium being the authority... no argument there! :)

                                      1 Reply Last reply
                                      0
                                      • L Le rner

                                        i m using this now but its not successful.

                                        FILE *stream;
                                        char buffer[2];
                                        int kk, ch;

                                           // Open file to read line from:
                                           fopen\_s( &stream, OpenFile, "r" );
                                           if( stream == NULL )
                                           {
                                        	   return ;
                                        	  //exit( 0 );
                                           }
                                        
                                           // Read in first 80 characters and place them in "buffer": 
                                           ch = fgetc( stream );
                                           for( kk=0; (kk < 1 ) && ( feof( stream ) == 0 ); kk++ )
                                           {
                                        	  buffer\[kk\] = (char)ch;
                                        	  ch = fgetc( stream );
                                           }
                                        
                                           // Add null to end string 
                                           buffer\[kk\] = '\\0';
                                           printf( "%s\\n", buffer );
                                           fclose( stream );
                                        

                                        here its unable to open the file and stream is alwaz null thats why its return.

                                        M Offline
                                        M Offline
                                        malaugh
                                        wrote on last edited by
                                        #19

                                        If you have the program set to unicode, you need to use _wfopen_s to open the file, and the filename (OpenFile) needs to be specified as wchar_t something like wchar_t Myfile[] = "my_file.ext"; Then you should be able to use fgetwc to get the characters using ch = fgetwc( stream ); your should specify ch as wchar_t

                                        1 Reply Last reply
                                        0
                                        Reply
                                        • Reply as topic
                                        Log in to reply
                                        • Oldest to Newest
                                        • Newest to Oldest
                                        • Most Votes


                                        • Login

                                        • Don't have an account? Register

                                        • Login or register to search.
                                        • First post
                                          Last post
                                        0
                                        • Categories
                                        • Recent
                                        • Tags
                                        • Popular
                                        • World
                                        • Users
                                        • Groups