How can read a unicode text file as character by character?

Albert Holguin

Hope you do know that unicode is two bytes and ascii is one byte...

Emilio Garavaglia

UNICODE is actually a set of code-points whose cardinality requires 21 bits. When encoded in sequence of 1 bye is called UTF-8 and when encoded as sequence of two bytes is called UTF-16. In UTF-8 coding may vary from 1 to 4 bytes (and remains identical for code-points between 0 and 127, aka ASCII) In UTF-16 coding may be 2 or 4 bytes (and is TWO for the most of Latin, Cyrillic and Greek characters, as many simplified Chinese). UNICODE==2bytes is a misconception that originated at the time Windows included Unicode APIS using 16bits since -at that time- Unicode specs where not so wide. Actually, reading 2bytes does not necessarily means "read a character".

2 bugs found. > recompile ... 65534 bugs found. :doh:

Emilio Garavaglia

This is actually a miscoception ... see here[^].

2 bugs found. > recompile ... 65534 bugs found. :doh:

Albert Holguin

unfortunately, i think it depends on what standard of C/C++ and what OS. I'm pretty sure windows defines unicode as 16bits... from their website: "Unicode: A fixed-width, 16-bit worldwide character encoding that was developed and is maintained and promoted by the Unicode Consortium, a nonprofit computer industry organization." http://msdn.microsoft.com/en-us/library/cc194793.aspx[^]

Albert Holguin

Here's another reference to Microsoft: http://msdn.microsoft.com/en-us/library/2dax2h36.aspx[^]

Emilio Garavaglia

I'm sorry for you and for Microsoft, but the one and only entitled to say what Unicode was, is and will be is www.unicode.org[^] The page you linked is a very shame for Microsoft. A technical document like that cannot be written without specifying in the page itself a data when it was written (hey ... they speak about their new amazing Windows NT 3.5 ...) and for this sole fault should disqualify M$ of whatever authority in the field.

2 bugs found. > recompile ... 65534 bugs found. :doh:

Albert Holguin

so angry! :laugh: ...similar articles found in the MS VS2010 area of MSDN... i don't do much in unicode so haven't needed to worry about it...

Emilio Garavaglia

Nope. 1) Unicode is not a Microsoft product. What UNICODE is is defined by www.unicode.org[^] 2) Microsoft use to encode UNICODE into 16-bits units. That is a technique well defined by the UNICODE standard itself, known as UTF-16. Essentially, every code not in the range 0xD800-0xDFFF and lower than 0xFFFF is code as itself. Every other greater that 0xFFFF is broken in two 10-bits chunks, or-ed with 0xD800 and 0xDC00 respectively. The range 0xD800 - 0xDFFF is called "UNICODE surropgate" and does not contain valid codepoints. So you can have single unicode characters requiring two wchar_t in sequence to be represented and sequences of two wchar_t representing a single character, with code greater than 0xFFFF (typical for CJK - Chinese, Japanese, Corean characters).

2 bugs found. > recompile ... 65534 bugs found. :doh:

Albert Holguin

i certainly believe your point about unicode consortium being the authority... no argument there! :)

malaugh

If you have the program set to unicode, you need to use _wfopen_s to open the file, and the filename (OpenFile) needs to be specified as wchar_t something like wchar_t Myfile[] = "my_file.ext"; Then you should be able to use fgetwc to get the characters using ch = fgetwc( stream ); your should specify ch as wchar_t