UTF8Encoding question
-
Hello everybody, Basically the problem is that the UTF8Encoder is not able to get this char '²'. Is this a known fact, or am I doing something wrong? This is what I do in a test routine:
using (FileStream fs = File.OpenRead(filename)) { byte\[\] b = new byte\[10000\]; UTF8Encoding temp = new UTF8Encoding(false,true); int l=0; FileStream fw= File.Open("copy.txt",System.IO.FileMode.Create); while ((l = fs.Read(b,0,b.Length)) > 0) { string s= temp.GetString(b, 0, l); Console.WriteLine(s); //... } fw.Close(); }
Thanks for your time and help All the best, Martin
-
Hello everybody, Basically the problem is that the UTF8Encoder is not able to get this char '²'. Is this a known fact, or am I doing something wrong? This is what I do in a test routine:
using (FileStream fs = File.OpenRead(filename)) { byte\[\] b = new byte\[10000\]; UTF8Encoding temp = new UTF8Encoding(false,true); int l=0; FileStream fw= File.Open("copy.txt",System.IO.FileMode.Create); while ((l = fs.Read(b,0,b.Length)) > 0) { string s= temp.GetString(b, 0, l); Console.WriteLine(s); //... } fw.Close(); }
Thanks for your time and help All the best, Martin
Can your console output the character even if it is read correctly? If your console code page does not support it your program won't display it. UTF8 is multibyte, so your read operation might split the byte stream in the middle of a character. Create a StreamReader with Encoding.UTF8 to avoid having to deal with this manually. Edit: Are you sure your file is actually UTF-8 encoded? :)
-
Can your console output the character even if it is read correctly? If your console code page does not support it your program won't display it. UTF8 is multibyte, so your read operation might split the byte stream in the middle of a character. Create a StreamReader with Encoding.UTF8 to avoid having to deal with this manually. Edit: Are you sure your file is actually UTF-8 encoded? :)
-
Hello, First, thanks for youre time! Actually I started by using a StreamReader, with the result that the complete line was there but without the '²' character. Any other suggestions? All the best, Martin
First of all, check the source file (in my experience the most likely error). The character should be encoded with the byte sequence C2 B2. As a quick hack to test it rename the input file so it has a bin extension and load it in Visual Studio - you should now see the byte content in a hex format.
-
First of all, check the source file (in my experience the most likely error). The character should be encoded with the byte sequence C2 B2. As a quick hack to test it rename the input file so it has a bin extension and load it in Visual Studio - you should now see the byte content in a hex format.
Hmm, I did the test with following result "B2" and no character is shown. When I open it with the "TotalComander" Bin style, it's also "B2" but the '²' is shown. So uses UTF8 really multibyte, or only for special characters? Thanks again for your time and patiance. All the best, Martin
-
First of all, check the source file (in my experience the most likely error). The character should be encoded with the byte sequence C2 B2. As a quick hack to test it rename the input file so it has a bin extension and load it in Visual Studio - you should now see the byte content in a hex format.
-
Hello, It really looks like the file isn't really UTF8. (It's a XML file with a discribtion that says its UTF8 formated :confused:) But I don't know wich encode I have to use. Is there a way to find this out? All the best, Martin
Yes, this is seen before - it's an invalid XML file, a lot of programs spit those out. :( A fair guess is the local ANSI codepage whereever the file was saved. For most Western European languages, this would be the Windows codepage 1252 (or use Encoding.Default if the encoding is the same on the machine reading and writing the file).
-
Hmm, I did the test with following result "B2" and no character is shown. When I open it with the "TotalComander" Bin style, it's also "B2" but the '²' is shown. So uses UTF8 really multibyte, or only for special characters? Thanks again for your time and patiance. All the best, Martin
-
UTF-8 use more than one byte for anything over Unicode character 127. Some refers to these as special, others refer to English characters as special. :)
-
Yes, this is seen before - it's an invalid XML file, a lot of programs spit those out. :( A fair guess is the local ANSI codepage whereever the file was saved. For most Western European languages, this would be the Windows codepage 1252 (or use Encoding.Default if the encoding is the same on the machine reading and writing the file).