Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. UTF8Encoding question

UTF8Encoding question

Scheduled Pinned Locked Moved C#
helpquestion
10 Posts 2 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M Offline
    M Offline
    Martin 0
    wrote on last edited by
    #1

    Hello everybody, Basically the problem is that the UTF8Encoder is not able to get this char '²'. Is this a known fact, or am I doing something wrong? This is what I do in a test routine:

                  using (FileStream fs = File.OpenRead(filename)) 
                  {
                        byte\[\] b = new byte\[10000\];
                        UTF8Encoding temp = new UTF8Encoding(false,true);
                        int l=0;
    
                        FileStream fw= File.Open("copy.txt",System.IO.FileMode.Create);
    
                        while ((l = fs.Read(b,0,b.Length)) > 0) 
                        {
                             string s= temp.GetString(b, 0, l); 
                             Console.WriteLine(s);
                             //...
                        }
                        fw.Close();
                  }
    

    Thanks for your time and help All the best, Martin

    L 1 Reply Last reply
    0
    • M Martin 0

      Hello everybody, Basically the problem is that the UTF8Encoder is not able to get this char '²'. Is this a known fact, or am I doing something wrong? This is what I do in a test routine:

                    using (FileStream fs = File.OpenRead(filename)) 
                    {
                          byte\[\] b = new byte\[10000\];
                          UTF8Encoding temp = new UTF8Encoding(false,true);
                          int l=0;
      
                          FileStream fw= File.Open("copy.txt",System.IO.FileMode.Create);
      
                          while ((l = fs.Read(b,0,b.Length)) > 0) 
                          {
                               string s= temp.GetString(b, 0, l); 
                               Console.WriteLine(s);
                               //...
                          }
                          fw.Close();
                    }
      

      Thanks for your time and help All the best, Martin

      L Offline
      L Offline
      lmoelleb
      wrote on last edited by
      #2

      Can your console output the character even if it is read correctly? If your console code page does not support it your program won't display it. UTF8 is multibyte, so your read operation might split the byte stream in the middle of a character. Create a StreamReader with Encoding.UTF8 to avoid having to deal with this manually. Edit: Are you sure your file is actually UTF-8 encoded? :)

      M 1 Reply Last reply
      0
      • L lmoelleb

        Can your console output the character even if it is read correctly? If your console code page does not support it your program won't display it. UTF8 is multibyte, so your read operation might split the byte stream in the middle of a character. Create a StreamReader with Encoding.UTF8 to avoid having to deal with this manually. Edit: Are you sure your file is actually UTF-8 encoded? :)

        M Offline
        M Offline
        Martin 0
        wrote on last edited by
        #3

        Hello, First, thanks for youre time! Actually I started by using a StreamReader, with the result that the complete line was there but without the '²' character. Any other suggestions? All the best, Martin

        L 1 Reply Last reply
        0
        • M Martin 0

          Hello, First, thanks for youre time! Actually I started by using a StreamReader, with the result that the complete line was there but without the '²' character. Any other suggestions? All the best, Martin

          L Offline
          L Offline
          lmoelleb
          wrote on last edited by
          #4

          First of all, check the source file (in my experience the most likely error). The character should be encoded with the byte sequence C2 B2. As a quick hack to test it rename the input file so it has a bin extension and load it in Visual Studio - you should now see the byte content in a hex format.

          M 2 Replies Last reply
          0
          • L lmoelleb

            First of all, check the source file (in my experience the most likely error). The character should be encoded with the byte sequence C2 B2. As a quick hack to test it rename the input file so it has a bin extension and load it in Visual Studio - you should now see the byte content in a hex format.

            M Offline
            M Offline
            Martin 0
            wrote on last edited by
            #5

            Hmm, I did the test with following result "B2" and no character is shown. When I open it with the "TotalComander" Bin style, it's also "B2" but the '²' is shown. So uses UTF8 really multibyte, or only for special characters? Thanks again for your time and patiance. All the best, Martin

            L 1 Reply Last reply
            0
            • L lmoelleb

              First of all, check the source file (in my experience the most likely error). The character should be encoded with the byte sequence C2 B2. As a quick hack to test it rename the input file so it has a bin extension and load it in Visual Studio - you should now see the byte content in a hex format.

              M Offline
              M Offline
              Martin 0
              wrote on last edited by
              #6

              Hello, It really looks like the file isn't really UTF8. (It's a XML file with a discribtion that says its UTF8 formated :confused:) But I don't know wich encode I have to use. Is there a way to find this out? All the best, Martin

              L 1 Reply Last reply
              0
              • M Martin 0

                Hello, It really looks like the file isn't really UTF8. (It's a XML file with a discribtion that says its UTF8 formated :confused:) But I don't know wich encode I have to use. Is there a way to find this out? All the best, Martin

                L Offline
                L Offline
                lmoelleb
                wrote on last edited by
                #7

                Yes, this is seen before - it's an invalid XML file, a lot of programs spit those out. :( A fair guess is the local ANSI codepage whereever the file was saved. For most Western European languages, this would be the Windows codepage 1252 (or use Encoding.Default if the encoding is the same on the machine reading and writing the file).

                M 1 Reply Last reply
                0
                • M Martin 0

                  Hmm, I did the test with following result "B2" and no character is shown. When I open it with the "TotalComander" Bin style, it's also "B2" but the '²' is shown. So uses UTF8 really multibyte, or only for special characters? Thanks again for your time and patiance. All the best, Martin

                  L Offline
                  L Offline
                  lmoelleb
                  wrote on last edited by
                  #8

                  UTF-8 use more than one byte for anything over Unicode character 127. Some refers to these as special, others refer to English characters as special. :)

                  M 1 Reply Last reply
                  0
                  • L lmoelleb

                    UTF-8 use more than one byte for anything over Unicode character 127. Some refers to these as special, others refer to English characters as special. :)

                    M Offline
                    M Offline
                    Martin 0
                    wrote on last edited by
                    #9

                    Thanks for the info!

                    1 Reply Last reply
                    0
                    • L lmoelleb

                      Yes, this is seen before - it's an invalid XML file, a lot of programs spit those out. :( A fair guess is the local ANSI codepage whereever the file was saved. For most Western European languages, this would be the Windows codepage 1252 (or use Encoding.Default if the encoding is the same on the machine reading and writing the file).

                      M Offline
                      M Offline
                      Martin 0
                      wrote on last edited by
                      #10

                      :-D Bulls Eye Thanks very much! I'm now using the GetEncoder method, and "Windows-1252" as the parameter. All the best, Martin

                      1 Reply Last reply
                      0
                      Reply
                      • Reply as topic
                      Log in to reply
                      • Oldest to Newest
                      • Newest to Oldest
                      • Most Votes


                      • Login

                      • Don't have an account? Register

                      • Login or register to search.
                      • First post
                        Last post
                      0
                      • Categories
                      • Recent
                      • Tags
                      • Popular
                      • World
                      • Users
                      • Groups