Character Encoding
-
Do I need to know about the encoding of a file before opening it or is it determined automatically by the framework? If .NET framework cannot automatically determine, then is there any way I can find this out myself? I'm not clear about this character encoding thing, so if you can please explain in detail, that would be of great help. Thx Gurmeet
BTW, can Google help me search my lost pajamas?
My Articles: HTML Reader C++ Class Library, Numeric Edit Control
-
Do I need to know about the encoding of a file before opening it or is it determined automatically by the framework? If .NET framework cannot automatically determine, then is there any way I can find this out myself? I'm not clear about this character encoding thing, so if you can please explain in detail, that would be of great help. Thx Gurmeet
BTW, can Google help me search my lost pajamas?
My Articles: HTML Reader C++ Class Library, Numeric Edit Control
StreamReader
has several constructors, some of which take anEncoding
and/or a boolean value to indicate whether the encoding should be detected or not. A bit of poking around in Reflector reveals that if you don't provide an encoding, it uses UTF8, and if you don't say otherwise, it tries to detect the encoding rather than use the default UTF8. When trying to detect an encoding, it uses the Byte Order Mark character. The Unicode standard indicates that this character, U+FEFF, should appear at the beginning of the text in whatever encoding is used. In UTF-16 little-endian, this becomes the byte sequence 0xFF 0xFE; in UTF-8, it's (IIRC) 0xEF 0xBB 0xBF. If there's no Byte Order Mark, it simply uses the encoding specified in the constructor, unless you didn't use one of those variants, in which case it uses UTF-8. .NET can also detect UTF-16BE, or big-endian, where the bytes of UTF-16 are the other way round. If you useFile.OpenText
orFileInfo.OpenText
, you don't get to specify an encoding. Unfortunately very few of us have files encoded as UTF-8. They're far more likely to be encoded using our default code page. For most Western European and North American users, this is going to be Windows 1252 (Windows Western). You can get hold of an encoding for the user's configured ANSI code page usingEncoding.Default
. Western users, particularly UK, US and Canada, may not notice at first that the encoding is different, because the first 256 code points of Unicode are the same as ISO Latin 1 (a little, though not a lot, different from 1252). Due to the way it's encoded, the first 128 code points of UTF-8 are also the same as Latin 1 and ASCII (ISO-646-US). Any UTF-8 code byte greater than 127 indicates that one or more following bytes needs to be interpreted along with this one to get the full character. There's no reliable way to detect which encoding is used by a random sample of text in a byte-oriented character stream (which isn't UTF-8). The concept of Byte Order Marks is relatively new. You either have to know or ask the user. More information links: Microsoft Global Development Portal[^] Code Page reference tables[ -
StreamReader
has several constructors, some of which take anEncoding
and/or a boolean value to indicate whether the encoding should be detected or not. A bit of poking around in Reflector reveals that if you don't provide an encoding, it uses UTF8, and if you don't say otherwise, it tries to detect the encoding rather than use the default UTF8. When trying to detect an encoding, it uses the Byte Order Mark character. The Unicode standard indicates that this character, U+FEFF, should appear at the beginning of the text in whatever encoding is used. In UTF-16 little-endian, this becomes the byte sequence 0xFF 0xFE; in UTF-8, it's (IIRC) 0xEF 0xBB 0xBF. If there's no Byte Order Mark, it simply uses the encoding specified in the constructor, unless you didn't use one of those variants, in which case it uses UTF-8. .NET can also detect UTF-16BE, or big-endian, where the bytes of UTF-16 are the other way round. If you useFile.OpenText
orFileInfo.OpenText
, you don't get to specify an encoding. Unfortunately very few of us have files encoded as UTF-8. They're far more likely to be encoded using our default code page. For most Western European and North American users, this is going to be Windows 1252 (Windows Western). You can get hold of an encoding for the user's configured ANSI code page usingEncoding.Default
. Western users, particularly UK, US and Canada, may not notice at first that the encoding is different, because the first 256 code points of Unicode are the same as ISO Latin 1 (a little, though not a lot, different from 1252). Due to the way it's encoded, the first 128 code points of UTF-8 are also the same as Latin 1 and ASCII (ISO-646-US). Any UTF-8 code byte greater than 127 indicates that one or more following bytes needs to be interpreted along with this one to get the full character. There's no reliable way to detect which encoding is used by a random sample of text in a byte-oriented character stream (which isn't UTF-8). The concept of Byte Order Marks is relatively new. You either have to know or ask the user. More information links: Microsoft Global Development Portal[^] Code Page reference tables[ -
StreamReader
has several constructors, some of which take anEncoding
and/or a boolean value to indicate whether the encoding should be detected or not. A bit of poking around in Reflector reveals that if you don't provide an encoding, it uses UTF8, and if you don't say otherwise, it tries to detect the encoding rather than use the default UTF8. When trying to detect an encoding, it uses the Byte Order Mark character. The Unicode standard indicates that this character, U+FEFF, should appear at the beginning of the text in whatever encoding is used. In UTF-16 little-endian, this becomes the byte sequence 0xFF 0xFE; in UTF-8, it's (IIRC) 0xEF 0xBB 0xBF. If there's no Byte Order Mark, it simply uses the encoding specified in the constructor, unless you didn't use one of those variants, in which case it uses UTF-8. .NET can also detect UTF-16BE, or big-endian, where the bytes of UTF-16 are the other way round. If you useFile.OpenText
orFileInfo.OpenText
, you don't get to specify an encoding. Unfortunately very few of us have files encoded as UTF-8. They're far more likely to be encoded using our default code page. For most Western European and North American users, this is going to be Windows 1252 (Windows Western). You can get hold of an encoding for the user's configured ANSI code page usingEncoding.Default
. Western users, particularly UK, US and Canada, may not notice at first that the encoding is different, because the first 256 code points of Unicode are the same as ISO Latin 1 (a little, though not a lot, different from 1252). Due to the way it's encoded, the first 128 code points of UTF-8 are also the same as Latin 1 and ASCII (ISO-646-US). Any UTF-8 code byte greater than 127 indicates that one or more following bytes needs to be interpreted along with this one to get the full character. There's no reliable way to detect which encoding is used by a random sample of text in a byte-oriented character stream (which isn't UTF-8). The concept of Byte Order Marks is relatively new. You either have to know or ask the user. More information links: Microsoft Global Development Portal[^] Code Page reference tables[Great information! I just wanted to add that BOMs (byte order marks) aren't always present in a text file as well. There no requirement for BOMs. While there's no reliable way to detect encoding - like you said - web browers and other applications (like Word) do try to detect the encoding. If you - the original poster - needs to do something like that, a simple (but probably not the most efficient way) is to take a random sampling of strings within the text file and use
StringInfo.GetTextElementEnumerator
and enumerate the text elements. With either all of those or a random sample, callTextElementEnumerator.GetTextElement
(returns aString
) and check theLength
. If it's greater than one, you at least know you're dealing with a multi-byte character set (MBCS), like UTF-8. If all of them were 2 bytes, then it's likely it's a double-byte character set (DBCS), like UTF-16 (there's also 4-byte characters, known as UTF-32!). If they're all 1 byte, then you've probably got an ASCII (or other single-byte encoding) file. From there you can make certain assumptions. You see browsers doing this when they start displaying question marks for chracters in odd places (this also happens when the specified encoding is wrong).Microsoft MVP, Visual C# My Articles