UTF8Encoding question

Martin 0

Hello everybody, Basically the problem is that the UTF8Encoder is not able to get this char '²'. Is this a known fact, or am I doing something wrong? This is what I do in a test routine:

              using (FileStream fs = File.OpenRead(filename)) 
              {
                    byte\[\] b = new byte\[10000\];
                    UTF8Encoding temp = new UTF8Encoding(false,true);
                    int l=0;

                    FileStream fw= File.Open("copy.txt",System.IO.FileMode.Create);

                    while ((l = fs.Read(b,0,b.Length)) > 0) 
                    {
                         string s= temp.GetString(b, 0, l); 
                         Console.WriteLine(s);
                         //...
                    }
                    fw.Close();
              }

Thanks for your time and help All the best, Martin

lmoelleb

Can your console output the character even if it is read correctly? If your console code page does not support it your program won't display it. UTF8 is multibyte, so your read operation might split the byte stream in the middle of a character. Create a StreamReader with Encoding.UTF8 to avoid having to deal with this manually. Edit: Are you sure your file is actually UTF-8 encoded? :)

Martin 0

Hello, First, thanks for youre time! Actually I started by using a StreamReader, with the result that the complete line was there but without the '²' character. Any other suggestions? All the best, Martin

lmoelleb

First of all, check the source file (in my experience the most likely error). The character should be encoded with the byte sequence C2 B2. As a quick hack to test it rename the input file so it has a bin extension and load it in Visual Studio - you should now see the byte content in a hex format.

Martin 0

Hmm, I did the test with following result "B2" and no character is shown. When I open it with the "TotalComander" Bin style, it's also "B2" but the '²' is shown. So uses UTF8 really multibyte, or only for special characters? Thanks again for your time and patiance. All the best, Martin

Martin 0

Hello, It really looks like the file isn't really UTF8. (It's a XML file with a discribtion that says its UTF8 formated :confused:) But I don't know wich encode I have to use. Is there a way to find this out? All the best, Martin

lmoelleb

Yes, this is seen before - it's an invalid XML file, a lot of programs spit those out. :( A fair guess is the local ANSI codepage whereever the file was saved. For most Western European languages, this would be the Windows codepage 1252 (or use Encoding.Default if the encoding is the same on the machine reading and writing the file).

lmoelleb

UTF-8 use more than one byte for anything over Unicode character 127. Some refers to these as special, others refer to English characters as special. :)

Martin 0

Thanks for the info!

Martin 0

:-D Bulls Eye Thanks very much! I'm now using the GetEncoder method, and "Windows-1252" as the parameter. All the best, Martin