Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. How to get a string's encoding

How to get a string's encoding

Scheduled Pinned Locked Moved C#
comjsontutorialquestion
6 Posts 2 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M Offline
    M Offline
    Matt Gerrans
    wrote on last edited by
    #1

    Given a particular string, is there a way to tell whether it contains unicode characters? I know could test every character's range, but I'm wondering if there is some API call for this. I'm looking in string, System.Text.Encoding, and Globalization, but haven't found any likely suspects yet... Matt Gerrans

    H 1 Reply Last reply
    0
    • M Matt Gerrans

      Given a particular string, is there a way to tell whether it contains unicode characters? I know could test every character's range, but I'm wondering if there is some API call for this. I'm looking in string, System.Text.Encoding, and Globalization, but haven't found any likely suspects yet... Matt Gerrans

      H Offline
      H Offline
      Heath Stewart
      wrote on last edited by
      #2

      Strings in .NET are stored as Unicode. The encoding only matters when reading and write from and to streams (text files, network streams, etc.). This posting is provided "AS IS" with no warranties, and confers no rights. Software Design Engineer Developer Division Sustained Engineering Microsoft [My Articles] [My Blog]

      M 1 Reply Last reply
      0
      • H Heath Stewart

        Strings in .NET are stored as Unicode. The encoding only matters when reading and write from and to streams (text files, network streams, etc.). This posting is provided "AS IS" with no warranties, and confers no rights. Software Design Engineer Developer Division Sustained Engineering Microsoft [My Articles] [My Blog]

        M Offline
        M Offline
        Matt Gerrans
        wrote on last edited by
        #3

        I know that. I'm adding lines to existing files. If the existing file is ASCII and I'm adding regular ASCII text, then everything's fine. However, if the line I'm adding has some Unicode characters (which may be the case), I want to change the file's encoding to Unicode and rewrite the thing out. (If the existing file is already Unicode, it is easy of course (except for this StreamReader bug that tells you it is UTF8)). I've figured out how to do this aready, by going through all the characters and checking for any out of the 0-255 range, but I was wondering if there was an API call, or more idiomatic way of handling this. Matt Gerrans

        H 1 Reply Last reply
        0
        • M Matt Gerrans

          I know that. I'm adding lines to existing files. If the existing file is ASCII and I'm adding regular ASCII text, then everything's fine. However, if the line I'm adding has some Unicode characters (which may be the case), I want to change the file's encoding to Unicode and rewrite the thing out. (If the existing file is already Unicode, it is easy of course (except for this StreamReader bug that tells you it is UTF8)). I've figured out how to do this aready, by going through all the characters and checking for any out of the 0-255 range, but I was wondering if there was an API call, or more idiomatic way of handling this. Matt Gerrans

          H Offline
          H Offline
          Heath Stewart
          wrote on last edited by
          #4

          And why do you call it a bug? It sounds correct. UTF8 is an MBCS (multi-byte character set) that uses 7-bit characters as ANSI does, but 8-bit characters (i.e., the 8th bit is set) denotes Unicode codepoints. That's the beauty of UTF8 - it maintains backward compatibility so long as you don't use Unicode, and if you must it allows for that. So, use the UTF8Encoding instead. This posting is provided "AS IS" with no warranties, and confers no rights. Software Design Engineer Developer Division Sustained Engineering Microsoft [My Articles] [My Blog]

          M 1 Reply Last reply
          0
          • H Heath Stewart

            And why do you call it a bug? It sounds correct. UTF8 is an MBCS (multi-byte character set) that uses 7-bit characters as ANSI does, but 8-bit characters (i.e., the 8th bit is set) denotes Unicode codepoints. That's the beauty of UTF8 - it maintains backward compatibility so long as you don't use Unicode, and if you must it allows for that. So, use the UTF8Encoding instead. This posting is provided "AS IS" with no warranties, and confers no rights. Software Design Engineer Developer Division Sustained Engineering Microsoft [My Articles] [My Blog]

            M Offline
            M Offline
            Matt Gerrans
            wrote on last edited by
            #5

            It is incorrect if the file's BOM ("\xff\xfe") says it is Unicode (or UTF16) and the reader thinks it is UTF8 ("\xef\xbb\xbf"). So I call it a bug, because I think it is one. I've since noticed that if I use the StreamReader's string constructor, it correctly identifies it as Unicode, but if I use the FileStream constructor it mis-identifies it as UTF8. So if I do this with a Unicode encoded file:

            void SomeMethod( FileInfo info )
            {
            StreamReader reader = new StreamReader( info.OpenRead() );
            System.Text.Encoding encoding = reader.CurrentEncoding; // UTF8!?
            string data = reader.ReadToEnd();
            reader.Close();

            data = Massage(data);

            StreamWriter writer = new StreamWriter( info.OpenWrite(), encoding );
            writer.write(data);
            writer.Close();
            }

            I get a UTF8 encoded file as a result, which I don't want. On the other hand, if I do this:

            void SomeMethod( FileInfo info )
            {
            // Use the filename instead of OpenRead()):
            StreamReader reader = new StreamReader( info.FullPath );
            System.Text.Encoding encoding = reader.CurrentEncoding; // Unicode!
            string data = reader.ReadToEnd();
            reader.Close();

            data = Massage(data);

            StreamWriter writer = new StreamWriter( info.OpenWrite(), encoding );
            writer.write(data);
            writer.Close();
            }

            The file will be Unicode, as expected (and desired). Maybe the intermediate use of the FileStream causes the loss of the encoding? Because these files are used by multiple platforms and programming languages (not all of which support MBCS), I want to simply use either ASCII or Unicode, but not UTF8 (or UTF7 or Unicode Big-Endian, etc.). I think there is not that much beauty in UTF8 (the "backward compatibility" also get hosed by use of extended ASCII characters, which usually comes from "backward" text files that were using drawing characters and the like), just unnecessary complexity, especially in these days of multi-gigabyte storage. By the way, the original question was about detecting the presense of Unicode characters in a string (which, having 16-bit characters could contain some, or not); this would affect the case where the original file was ASCII, but a line with some Unicode characters were inserted into it. In that case, I just want to switch the whole file over to Unicode. <

            H 1 Reply Last reply
            0
            • M Matt Gerrans

              It is incorrect if the file's BOM ("\xff\xfe") says it is Unicode (or UTF16) and the reader thinks it is UTF8 ("\xef\xbb\xbf"). So I call it a bug, because I think it is one. I've since noticed that if I use the StreamReader's string constructor, it correctly identifies it as Unicode, but if I use the FileStream constructor it mis-identifies it as UTF8. So if I do this with a Unicode encoded file:

              void SomeMethod( FileInfo info )
              {
              StreamReader reader = new StreamReader( info.OpenRead() );
              System.Text.Encoding encoding = reader.CurrentEncoding; // UTF8!?
              string data = reader.ReadToEnd();
              reader.Close();

              data = Massage(data);

              StreamWriter writer = new StreamWriter( info.OpenWrite(), encoding );
              writer.write(data);
              writer.Close();
              }

              I get a UTF8 encoded file as a result, which I don't want. On the other hand, if I do this:

              void SomeMethod( FileInfo info )
              {
              // Use the filename instead of OpenRead()):
              StreamReader reader = new StreamReader( info.FullPath );
              System.Text.Encoding encoding = reader.CurrentEncoding; // Unicode!
              string data = reader.ReadToEnd();
              reader.Close();

              data = Massage(data);

              StreamWriter writer = new StreamWriter( info.OpenWrite(), encoding );
              writer.write(data);
              writer.Close();
              }

              The file will be Unicode, as expected (and desired). Maybe the intermediate use of the FileStream causes the loss of the encoding? Because these files are used by multiple platforms and programming languages (not all of which support MBCS), I want to simply use either ASCII or Unicode, but not UTF8 (or UTF7 or Unicode Big-Endian, etc.). I think there is not that much beauty in UTF8 (the "backward compatibility" also get hosed by use of extended ASCII characters, which usually comes from "backward" text files that were using drawing characters and the like), just unnecessary complexity, especially in these days of multi-gigabyte storage. By the way, the original question was about detecting the presense of Unicode characters in a string (which, having 16-bit characters could contain some, or not); this would affect the case where the original file was ASCII, but a line with some Unicode characters were inserted into it. In that case, I just want to switch the whole file over to Unicode. <

              H Offline
              H Offline
              Heath Stewart
              wrote on last edited by
              #6

              You should use StreamReader reader = _info_.OpenText();. If you look at the IL, the code for opening a StreamReader from a stream vs. a filename is the same. The constructor which takes a string actually does the same thing you are - opens a FileStream and passes it to Init (an internal method which every constructor eventually calls). As for your original question, see the StringInfo class defined in the System.Globalization namespace. This allows you to enumerate characters (derived from however many code points), which you could then determine if a string contains one character or more. This posting is provided "AS IS" with no warranties, and confers no rights. Software Design Engineer Developer Division Sustained Engineering Microsoft [My Articles] [My Blog]

              1 Reply Last reply
              0
              Reply
              • Reply as topic
              Log in to reply
              • Oldest to Newest
              • Newest to Oldest
              • Most Votes


              • Login

              • Don't have an account? Register

              • Login or register to search.
              • First post
                Last post
              0
              • Categories
              • Recent
              • Tags
              • Popular
              • World
              • Users
              • Groups