How to get a string's encoding

Matt Gerrans

Given a particular string, is there a way to tell whether it contains unicode characters? I know could test every character's range, but I'm wondering if there is some API call for this. I'm looking in string, System.Text.Encoding, and Globalization, but haven't found any likely suspects yet... Matt Gerrans

Heath Stewart

Strings in .NET are stored as Unicode. The encoding only matters when reading and write from and to streams (text files, network streams, etc.). This posting is provided "AS IS" with no warranties, and confers no rights. Software Design Engineer Developer Division Sustained Engineering Microsoft [My Articles] [My Blog]

Matt Gerrans

I know that. I'm adding lines to existing files. If the existing file is ASCII and I'm adding regular ASCII text, then everything's fine. However, if the line I'm adding has some Unicode characters (which may be the case), I want to change the file's encoding to Unicode and rewrite the thing out. (If the existing file is already Unicode, it is easy of course (except for this StreamReader bug that tells you it is UTF8)). I've figured out how to do this aready, by going through all the characters and checking for any out of the 0-255 range, but I was wondering if there was an API call, or more idiomatic way of handling this. Matt Gerrans

Heath Stewart

And why do you call it a bug? It sounds correct. UTF8 is an MBCS (multi-byte character set) that uses 7-bit characters as ANSI does, but 8-bit characters (i.e., the 8th bit is set) denotes Unicode codepoints. That's the beauty of UTF8 - it maintains backward compatibility so long as you don't use Unicode, and if you must it allows for that. So, use the UTF8Encoding instead. This posting is provided "AS IS" with no warranties, and confers no rights. Software Design Engineer Developer Division Sustained Engineering Microsoft [My Articles] [My Blog]

Matt Gerrans

It is incorrect if the file's BOM ("\xff\xfe") says it is Unicode (or UTF16) and the reader thinks it is UTF8 ("\xef\xbb\xbf"). So I call it a bug, because I think it is one. I've since noticed that if I use the StreamReader's string constructor, it correctly identifies it as Unicode, but if I use the FileStream constructor it mis-identifies it as UTF8. So if I do this with a Unicode encoded file:

void SomeMethod( FileInfo info )
{
StreamReader reader = new StreamReader( info.OpenRead() );
System.Text.Encoding encoding = reader.CurrentEncoding; // UTF8!?
string data = reader.ReadToEnd();
reader.Close();

data = Massage(data);

StreamWriter writer = new StreamWriter( info.OpenWrite(), encoding );
writer.write(data);
writer.Close();
}

I get a UTF8 encoded file as a result, which I don't want. On the other hand, if I do this:

void SomeMethod( FileInfo info )
{
// Use the filename instead of OpenRead()):
StreamReader reader = new StreamReader( info.FullPath );
System.Text.Encoding encoding = reader.CurrentEncoding; // Unicode!
string data = reader.ReadToEnd();
reader.Close();

data = Massage(data);

StreamWriter writer = new StreamWriter( info.OpenWrite(), encoding );
writer.write(data);
writer.Close();
}

The file will be Unicode, as expected (and desired). Maybe the intermediate use of the FileStream causes the loss of the encoding? Because these files are used by multiple platforms and programming languages (not all of which support MBCS), I want to simply use either ASCII or Unicode, but not UTF8 (or UTF7 or Unicode Big-Endian, etc.). I think there is not that much beauty in UTF8 (the "backward compatibility" also get hosed by use of extended ASCII characters, which usually comes from "backward" text files that were using drawing characters and the like), just unnecessary complexity, especially in these days of multi-gigabyte storage. By the way, the original question was about detecting the presense of Unicode characters in a string (which, having 16-bit characters could contain some, or not); this would affect the case where the original file was ASCII, but a line with some Unicode characters were inserted into it. In that case, I just want to switch the whole file over to Unicode. <

Heath Stewart

You should use StreamReader reader = _info_.OpenText();. If you look at the IL, the code for opening a StreamReader from a stream vs. a filename is the same. The constructor which takes a string actually does the same thing you are - opens a FileStream and passes it to Init (an internal method which every constructor eventually calls). As for your original question, see the StringInfo class defined in the System.Globalization namespace. This allows you to enumerate characters (derived from however many code points), which you could then determine if a string contains one character or more. This posting is provided "AS IS" with no warranties, and confers no rights. Software Design Engineer Developer Division Sustained Engineering Microsoft [My Articles] [My Blog]