Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. Character Encoding

Character Encoding

Scheduled Pinned Locked Moved C#
csharpc++htmldotnetcom
4 Posts 4 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • G Offline
    G Offline
    gUrM33T
    wrote on last edited by
    #1

    Do I need to know about the encoding of a file before opening it or is it determined automatically by the framework? If .NET framework cannot automatically determine, then is there any way I can find this out myself? I'm not clear about this character encoding thing, so if you can please explain in detail, that would be of great help. Thx Gurmeet


    BTW, can Google help me search my lost pajamas?

    My Articles: HTML Reader C++ Class Library, Numeric Edit Control

    M 1 Reply Last reply
    0
    • G gUrM33T

      Do I need to know about the encoding of a file before opening it or is it determined automatically by the framework? If .NET framework cannot automatically determine, then is there any way I can find this out myself? I'm not clear about this character encoding thing, so if you can please explain in detail, that would be of great help. Thx Gurmeet


      BTW, can Google help me search my lost pajamas?

      My Articles: HTML Reader C++ Class Library, Numeric Edit Control

      M Offline
      M Offline
      Mike Dimmick
      wrote on last edited by
      #2

      StreamReader has several constructors, some of which take an Encoding and/or a boolean value to indicate whether the encoding should be detected or not. A bit of poking around in Reflector reveals that if you don't provide an encoding, it uses UTF8, and if you don't say otherwise, it tries to detect the encoding rather than use the default UTF8. When trying to detect an encoding, it uses the Byte Order Mark character. The Unicode standard indicates that this character, U+FEFF, should appear at the beginning of the text in whatever encoding is used. In UTF-16 little-endian, this becomes the byte sequence 0xFF 0xFE; in UTF-8, it's (IIRC) 0xEF 0xBB 0xBF. If there's no Byte Order Mark, it simply uses the encoding specified in the constructor, unless you didn't use one of those variants, in which case it uses UTF-8. .NET can also detect UTF-16BE, or big-endian, where the bytes of UTF-16 are the other way round. If you use File.OpenText or FileInfo.OpenText, you don't get to specify an encoding. Unfortunately very few of us have files encoded as UTF-8. They're far more likely to be encoded using our default code page. For most Western European and North American users, this is going to be Windows 1252 (Windows Western). You can get hold of an encoding for the user's configured ANSI code page using Encoding.Default. Western users, particularly UK, US and Canada, may not notice at first that the encoding is different, because the first 256 code points of Unicode are the same as ISO Latin 1 (a little, though not a lot, different from 1252). Due to the way it's encoded, the first 128 code points of UTF-8 are also the same as Latin 1 and ASCII (ISO-646-US). Any UTF-8 code byte greater than 127 indicates that one or more following bytes needs to be interpreted along with this one to get the full character. There's no reliable way to detect which encoding is used by a random sample of text in a byte-oriented character stream (which isn't UTF-8). The concept of Byte Order Marks is relatively new. You either have to know or ask the user. More information links: Microsoft Global Development Portal[^] Code Page reference tables[

      P H 2 Replies Last reply
      0
      • M Mike Dimmick

        StreamReader has several constructors, some of which take an Encoding and/or a boolean value to indicate whether the encoding should be detected or not. A bit of poking around in Reflector reveals that if you don't provide an encoding, it uses UTF8, and if you don't say otherwise, it tries to detect the encoding rather than use the default UTF8. When trying to detect an encoding, it uses the Byte Order Mark character. The Unicode standard indicates that this character, U+FEFF, should appear at the beginning of the text in whatever encoding is used. In UTF-16 little-endian, this becomes the byte sequence 0xFF 0xFE; in UTF-8, it's (IIRC) 0xEF 0xBB 0xBF. If there's no Byte Order Mark, it simply uses the encoding specified in the constructor, unless you didn't use one of those variants, in which case it uses UTF-8. .NET can also detect UTF-16BE, or big-endian, where the bytes of UTF-16 are the other way round. If you use File.OpenText or FileInfo.OpenText, you don't get to specify an encoding. Unfortunately very few of us have files encoded as UTF-8. They're far more likely to be encoded using our default code page. For most Western European and North American users, this is going to be Windows 1252 (Windows Western). You can get hold of an encoding for the user's configured ANSI code page using Encoding.Default. Western users, particularly UK, US and Canada, may not notice at first that the encoding is different, because the first 256 code points of Unicode are the same as ISO Latin 1 (a little, though not a lot, different from 1252). Due to the way it's encoded, the first 128 code points of UTF-8 are also the same as Latin 1 and ASCII (ISO-646-US). Any UTF-8 code byte greater than 127 indicates that one or more following bytes needs to be interpreted along with this one to get the full character. There's no reliable way to detect which encoding is used by a random sample of text in a byte-oriented character stream (which isn't UTF-8). The concept of Byte Order Marks is relatively new. You either have to know or ask the user. More information links: Microsoft Global Development Portal[^] Code Page reference tables[

        P Offline
        P Offline
        Paul Watson
        wrote on last edited by
        #3

        Great information, thanks Mike. regards, Paul Watson Bluegrass South Africa Chris Maunder wrote: "I'd rather cover myself in honey and lie on an ant's nest than commit myself to it publicly." Jon Sagara replied: "I think we've all been in that situation before." Crikey! ain't life grand?

        1 Reply Last reply
        0
        • M Mike Dimmick

          StreamReader has several constructors, some of which take an Encoding and/or a boolean value to indicate whether the encoding should be detected or not. A bit of poking around in Reflector reveals that if you don't provide an encoding, it uses UTF8, and if you don't say otherwise, it tries to detect the encoding rather than use the default UTF8. When trying to detect an encoding, it uses the Byte Order Mark character. The Unicode standard indicates that this character, U+FEFF, should appear at the beginning of the text in whatever encoding is used. In UTF-16 little-endian, this becomes the byte sequence 0xFF 0xFE; in UTF-8, it's (IIRC) 0xEF 0xBB 0xBF. If there's no Byte Order Mark, it simply uses the encoding specified in the constructor, unless you didn't use one of those variants, in which case it uses UTF-8. .NET can also detect UTF-16BE, or big-endian, where the bytes of UTF-16 are the other way round. If you use File.OpenText or FileInfo.OpenText, you don't get to specify an encoding. Unfortunately very few of us have files encoded as UTF-8. They're far more likely to be encoded using our default code page. For most Western European and North American users, this is going to be Windows 1252 (Windows Western). You can get hold of an encoding for the user's configured ANSI code page using Encoding.Default. Western users, particularly UK, US and Canada, may not notice at first that the encoding is different, because the first 256 code points of Unicode are the same as ISO Latin 1 (a little, though not a lot, different from 1252). Due to the way it's encoded, the first 128 code points of UTF-8 are also the same as Latin 1 and ASCII (ISO-646-US). Any UTF-8 code byte greater than 127 indicates that one or more following bytes needs to be interpreted along with this one to get the full character. There's no reliable way to detect which encoding is used by a random sample of text in a byte-oriented character stream (which isn't UTF-8). The concept of Byte Order Marks is relatively new. You either have to know or ask the user. More information links: Microsoft Global Development Portal[^] Code Page reference tables[

          H Offline
          H Offline
          Heath Stewart
          wrote on last edited by
          #4

          Great information! I just wanted to add that BOMs (byte order marks) aren't always present in a text file as well. There no requirement for BOMs. While there's no reliable way to detect encoding - like you said - web browers and other applications (like Word) do try to detect the encoding. If you - the original poster - needs to do something like that, a simple (but probably not the most efficient way) is to take a random sampling of strings within the text file and use StringInfo.GetTextElementEnumerator and enumerate the text elements. With either all of those or a random sample, call TextElementEnumerator.GetTextElement (returns a String) and check the Length. If it's greater than one, you at least know you're dealing with a multi-byte character set (MBCS), like UTF-8. If all of them were 2 bytes, then it's likely it's a double-byte character set (DBCS), like UTF-16 (there's also 4-byte characters, known as UTF-32!). If they're all 1 byte, then you've probably got an ASCII (or other single-byte encoding) file. From there you can make certain assumptions. You see browsers doing this when they start displaying question marks for chracters in odd places (this also happens when the specified encoding is wrong).

          Microsoft MVP, Visual C# My Articles

          1 Reply Last reply
          0
          Reply
          • Reply as topic
          Log in to reply
          • Oldest to Newest
          • Newest to Oldest
          • Most Votes


          • Login

          • Don't have an account? Register

          • Login or register to search.
          • First post
            Last post
          0
          • Categories
          • Recent
          • Tags
          • Popular
          • World
          • Users
          • Groups