How to get a string's encoding
-
Given a particular string, is there a way to tell whether it contains unicode characters? I know could test every character's range, but I'm wondering if there is some API call for this. I'm looking in string, System.Text.Encoding, and Globalization, but haven't found any likely suspects yet... Matt Gerrans
-
Given a particular string, is there a way to tell whether it contains unicode characters? I know could test every character's range, but I'm wondering if there is some API call for this. I'm looking in string, System.Text.Encoding, and Globalization, but haven't found any likely suspects yet... Matt Gerrans
Strings in .NET are stored as Unicode. The encoding only matters when reading and write from and to streams (text files, network streams, etc.). This posting is provided "AS IS" with no warranties, and confers no rights. Software Design Engineer Developer Division Sustained Engineering Microsoft [My Articles] [My Blog]
-
Strings in .NET are stored as Unicode. The encoding only matters when reading and write from and to streams (text files, network streams, etc.). This posting is provided "AS IS" with no warranties, and confers no rights. Software Design Engineer Developer Division Sustained Engineering Microsoft [My Articles] [My Blog]
I know that. I'm adding lines to existing files. If the existing file is ASCII and I'm adding regular ASCII text, then everything's fine. However, if the line I'm adding has some Unicode characters (which may be the case), I want to change the file's encoding to Unicode and rewrite the thing out. (If the existing file is already Unicode, it is easy of course (except for this StreamReader bug that tells you it is UTF8)). I've figured out how to do this aready, by going through all the characters and checking for any out of the 0-255 range, but I was wondering if there was an API call, or more idiomatic way of handling this. Matt Gerrans
-
I know that. I'm adding lines to existing files. If the existing file is ASCII and I'm adding regular ASCII text, then everything's fine. However, if the line I'm adding has some Unicode characters (which may be the case), I want to change the file's encoding to Unicode and rewrite the thing out. (If the existing file is already Unicode, it is easy of course (except for this StreamReader bug that tells you it is UTF8)). I've figured out how to do this aready, by going through all the characters and checking for any out of the 0-255 range, but I was wondering if there was an API call, or more idiomatic way of handling this. Matt Gerrans
And why do you call it a bug? It sounds correct. UTF8 is an MBCS (multi-byte character set) that uses 7-bit characters as ANSI does, but 8-bit characters (i.e., the 8th bit is set) denotes Unicode codepoints. That's the beauty of UTF8 - it maintains backward compatibility so long as you don't use Unicode, and if you must it allows for that. So, use the
UTF8Encoding
instead. This posting is provided "AS IS" with no warranties, and confers no rights. Software Design Engineer Developer Division Sustained Engineering Microsoft [My Articles] [My Blog] -
And why do you call it a bug? It sounds correct. UTF8 is an MBCS (multi-byte character set) that uses 7-bit characters as ANSI does, but 8-bit characters (i.e., the 8th bit is set) denotes Unicode codepoints. That's the beauty of UTF8 - it maintains backward compatibility so long as you don't use Unicode, and if you must it allows for that. So, use the
UTF8Encoding
instead. This posting is provided "AS IS" with no warranties, and confers no rights. Software Design Engineer Developer Division Sustained Engineering Microsoft [My Articles] [My Blog]It is incorrect if the file's BOM ("\xff\xfe") says it is Unicode (or UTF16) and the reader thinks it is UTF8 ("\xef\xbb\xbf"). So I call it a bug, because I think it is one. I've since noticed that if I use the
StreamReader
'sstring
constructor, it correctly identifies it as Unicode, but if I use theFileStream
constructor it mis-identifies it as UTF8. So if I do this with a Unicode encoded file:void SomeMethod( FileInfo info )
{
StreamReader reader = new StreamReader( info.OpenRead() );
System.Text.Encoding encoding = reader.CurrentEncoding; // UTF8!?
string data = reader.ReadToEnd();
reader.Close();data = Massage(data);
StreamWriter writer = new StreamWriter( info.OpenWrite(), encoding );
writer.write(data);
writer.Close();
}I get a UTF8 encoded file as a result, which I don't want. On the other hand, if I do this:
void SomeMethod( FileInfo info )
{
// Use the filename instead of OpenRead()):
StreamReader reader = new StreamReader( info.FullPath );
System.Text.Encoding encoding = reader.CurrentEncoding; // Unicode!
string data = reader.ReadToEnd();
reader.Close();data = Massage(data);
StreamWriter writer = new StreamWriter( info.OpenWrite(), encoding );
writer.write(data);
writer.Close();
}The file will be Unicode, as expected (and desired). Maybe the intermediate use of the
FileStream
causes the loss of the encoding? Because these files are used by multiple platforms and programming languages (not all of which support MBCS), I want to simply use either ASCII or Unicode, but not UTF8 (or UTF7 or Unicode Big-Endian, etc.). I think there is not that much beauty in UTF8 (the "backward compatibility" also get hosed by use of extended ASCII characters, which usually comes from "backward" text files that were using drawing characters and the like), just unnecessary complexity, especially in these days of multi-gigabyte storage. By the way, the original question was about detecting the presense of Unicode characters in a string (which, having 16-bit characters could contain some, or not); this would affect the case where the original file was ASCII, but a line with some Unicode characters were inserted into it. In that case, I just want to switch the whole file over to Unicode. < -
It is incorrect if the file's BOM ("\xff\xfe") says it is Unicode (or UTF16) and the reader thinks it is UTF8 ("\xef\xbb\xbf"). So I call it a bug, because I think it is one. I've since noticed that if I use the
StreamReader
'sstring
constructor, it correctly identifies it as Unicode, but if I use theFileStream
constructor it mis-identifies it as UTF8. So if I do this with a Unicode encoded file:void SomeMethod( FileInfo info )
{
StreamReader reader = new StreamReader( info.OpenRead() );
System.Text.Encoding encoding = reader.CurrentEncoding; // UTF8!?
string data = reader.ReadToEnd();
reader.Close();data = Massage(data);
StreamWriter writer = new StreamWriter( info.OpenWrite(), encoding );
writer.write(data);
writer.Close();
}I get a UTF8 encoded file as a result, which I don't want. On the other hand, if I do this:
void SomeMethod( FileInfo info )
{
// Use the filename instead of OpenRead()):
StreamReader reader = new StreamReader( info.FullPath );
System.Text.Encoding encoding = reader.CurrentEncoding; // Unicode!
string data = reader.ReadToEnd();
reader.Close();data = Massage(data);
StreamWriter writer = new StreamWriter( info.OpenWrite(), encoding );
writer.write(data);
writer.Close();
}The file will be Unicode, as expected (and desired). Maybe the intermediate use of the
FileStream
causes the loss of the encoding? Because these files are used by multiple platforms and programming languages (not all of which support MBCS), I want to simply use either ASCII or Unicode, but not UTF8 (or UTF7 or Unicode Big-Endian, etc.). I think there is not that much beauty in UTF8 (the "backward compatibility" also get hosed by use of extended ASCII characters, which usually comes from "backward" text files that were using drawing characters and the like), just unnecessary complexity, especially in these days of multi-gigabyte storage. By the way, the original question was about detecting the presense of Unicode characters in a string (which, having 16-bit characters could contain some, or not); this would affect the case where the original file was ASCII, but a line with some Unicode characters were inserted into it. In that case, I just want to switch the whole file over to Unicode. <You should use
StreamReader reader = _info_.OpenText();
. If you look at the IL, the code for opening aStreamReader
from a stream vs. a filename is the same. The constructor which takes astring
actually does the same thing you are - opens aFileStream
and passes it toInit
(an internal method which every constructor eventually calls). As for your original question, see theStringInfo
class defined in theSystem.Globalization
namespace. This allows you to enumerate characters (derived from however many code points), which you could then determine if a string contains one character or more. This posting is provided "AS IS" with no warranties, and confers no rights. Software Design Engineer Developer Division Sustained Engineering Microsoft [My Articles] [My Blog]