Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. Converting a unicode value (with a surrogate pair) to a character in C#

Converting a unicode value (with a surrogate pair) to a character in C#

Scheduled Pinned Locked Moved C#
tutorialcsharpquestion
13 Posts 4 Posters 1 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • V vnmatt

    Here you can see the character: http://www.cojak.org/index.php?function=code_lookup&term=2A601[^] I'm not sure what you're on about with the 20-bits thing, but if you have a way for me to turn U+2A601 into that character in c#, that would be great. That's all I am after. :-)

    L Offline
    L Offline
    Lost User
    wrote on last edited by
    #4

    gordon_matt wrote:

    I'm not sure what you're on about with the 20-bits thing

    Hex value 2A601 is five digits which is 20-bits, Unicode characters are only 16 bits. I'm not sure what translation table is being used in the link you posted, but I would suggest an email to the person who made the website as the best step forward.

    Just say 'NO' to evaluated arguments for diadic functions! Ash

    J 1 Reply Last reply
    0
    • L Lost User

      gordon_matt wrote:

      I'm not sure what you're on about with the 20-bits thing

      Hex value 2A601 is five digits which is 20-bits, Unicode characters are only 16 bits. I'm not sure what translation table is being used in the link you posted, but I would suggest an email to the person who made the website as the best step forward.

      Just say 'NO' to evaluated arguments for diadic functions! Ash

      J Offline
      J Offline
      jschell
      wrote on last edited by
      #5

      Richard MacCutchan wrote:

      Hex value 2A601 is five digits which is 20-bits, Unicode characters are only 16 bits

      In what context? Unicode as a standard supports any number of character sets. Bits has nothing to do with that. Each set is demarked by a minimum representational size called a code point which can be 8 bits, 16 bits, 32 bits and even 64 bits. The full range of characters in each set (which is not necessarily all possible unicode characters) is represented by one or more code points. The OP is asking about surrogate pairs in a 16 bit set. A suitable description of that is found here. One can find it at unicode.org as well but is requires digging. http://en.wikipedia.org/wiki/UTF-16/UCS-2[^] And C# certainly supports surrogate pairs. The following references the upper limit of 10FFFF which would be 21 bits. And 2A601 is certainly in that range. http://msdn.microsoft.com/en-us/library/aa664669(v=vs.71).aspx[^] Perhaps you are stating that 2A601 is not a valid character in the character set?

      L 1 Reply Last reply
      0
      • V vnmatt

        Hi Guys, I want to know how to get the correct character from a unicode value which has a surrogate key. For example: Here is my code that works with most characters: public static char ConvertUnicodeToCharacter(string unicodeValue) { int unicode = int.Parse(unicodeValue.Substring(2), NumberStyles.HexNumber); return (char)unicode; } This will take in something (like U+3400 for example) and spit out the character (in this case: 㐀). When I try this one: U+2A601, I always get this: ꘁ, but it should be this: 𪘁. Info on surrogate pairs: "UTF-16 Encoding Form. The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair" - http://www.unicode.org/glossary/ I just want to convert U+2A601 to 𪘁. Anyone know how? Thanks

        J Offline
        J Offline
        jschell
        wrote on last edited by
        #6

        gordon_matt wrote:

        I want to know how to get the correct character from a unicode value which has a surrogate key. For example:

        The data type 'char' only has 16 bits. For C# the full range of unicode characters extends to 10FFFF, which will not fit in a 'char'. The char represents a code point not a character in the full range of unicode character set. A surrogate pair is by definition two code points. So to represent it you must have two 'char' values. Alternatives would be to use an 'int' or a String.

        V 1 Reply Last reply
        0
        • J jschell

          gordon_matt wrote:

          I want to know how to get the correct character from a unicode value which has a surrogate key. For example:

          The data type 'char' only has 16 bits. For C# the full range of unicode characters extends to 10FFFF, which will not fit in a 'char'. The char represents a code point not a character in the full range of unicode character set. A surrogate pair is by definition two code points. So to represent it you must have two 'char' values. Alternatives would be to use an 'int' or a String.

          V Offline
          V Offline
          vnmatt
          wrote on last edited by
          #7

          Thanks for the reply. I understand this well enough, but maybe everyone here is not quite understanding what I want... Okay don't return a System.Char... give me a System.String...whatever.... I just want to do something like this: public string GetUnicodeCharacter(string uniValue) { //uniValue = "U+2A601" //Do some logic return myUnicodeCharacter; //string that will display 𪘁 } It seems nobody knows how to get this and it is driving me nuts. Yes I did email the author of that site (cojak.org), but no reply yet; maybe because of holidays, but I'll be lucky if he helps me anyway (not to mention his site is PHP, so might not be able to help with .NET). Maybe there is a way to get the surrogate values from the "U+2A601" string? Then I could do something (ugly as it may be) like this:

          public string myImaginaryMethod()
          {
          string original = "\u2A601";
          string highSurrogate = GetHighSurrogate(original); // returns "\uD869";
          string lowSurrogate = GetLowSurrogate(original); returns "\uDE01";
          return highSurrogate + lowSurrogate; //returns 𪘁
          }

          Surely someone must know how to get this value somehow...anyhow...I don't care how, so long as I can get 𪘁 from "U+2A601"; (or from "\u2A601" would be fine too) :-)

          J 1 Reply Last reply
          0
          • J jschell

            Richard MacCutchan wrote:

            Hex value 2A601 is five digits which is 20-bits, Unicode characters are only 16 bits

            In what context? Unicode as a standard supports any number of character sets. Bits has nothing to do with that. Each set is demarked by a minimum representational size called a code point which can be 8 bits, 16 bits, 32 bits and even 64 bits. The full range of characters in each set (which is not necessarily all possible unicode characters) is represented by one or more code points. The OP is asking about surrogate pairs in a 16 bit set. A suitable description of that is found here. One can find it at unicode.org as well but is requires digging. http://en.wikipedia.org/wiki/UTF-16/UCS-2[^] And C# certainly supports surrogate pairs. The following references the upper limit of 10FFFF which would be 21 bits. And 2A601 is certainly in that range. http://msdn.microsoft.com/en-us/library/aa664669(v=vs.71).aspx[^] Perhaps you are stating that 2A601 is not a valid character in the character set?

            L Offline
            L Offline
            Lost User
            wrote on last edited by
            #8

            Very interesting, and I stand corrected. But what is the answer to the OP's question?

            Just say 'NO' to evaluated arguments for diadic functions! Ash

            1 Reply Last reply
            0
            • V vnmatt

              Thanks for the reply. I understand this well enough, but maybe everyone here is not quite understanding what I want... Okay don't return a System.Char... give me a System.String...whatever.... I just want to do something like this: public string GetUnicodeCharacter(string uniValue) { //uniValue = "U+2A601" //Do some logic return myUnicodeCharacter; //string that will display 𪘁 } It seems nobody knows how to get this and it is driving me nuts. Yes I did email the author of that site (cojak.org), but no reply yet; maybe because of holidays, but I'll be lucky if he helps me anyway (not to mention his site is PHP, so might not be able to help with .NET). Maybe there is a way to get the surrogate values from the "U+2A601" string? Then I could do something (ugly as it may be) like this:

              public string myImaginaryMethod()
              {
              string original = "\u2A601";
              string highSurrogate = GetHighSurrogate(original); // returns "\uD869";
              string lowSurrogate = GetLowSurrogate(original); returns "\uDE01";
              return highSurrogate + lowSurrogate; //returns 𪘁
              }

              Surely someone must know how to get this value somehow...anyhow...I don't care how, so long as I can get 𪘁 from "U+2A601"; (or from "\u2A601" would be fine too) :-)

              J Offline
              J Offline
              jschell
              wrote on last edited by
              #9

              gordon_matt wrote:

              It seems nobody knows how to get this and it is driving me nuts

              Mentioning the surrogate pair probably confused the issue. It certainly did for me. The following code should be what you want (with appropriate bit manipulation.) If not then could you provide the unicode.org page reference for 2A601 (I couldn't find it.) // 2A601 byte[] array = new byte[4]; int i=0; array[i++] = (byte)0x0001; // Little endian array[i++] = (byte)0x00A6; array[i++] = (byte)0x0002; array[i++] = (byte)0x0000; String r = System.Text.Encoding.UTF32.GetString(array);

              V 1 Reply Last reply
              0
              • J jschell

                gordon_matt wrote:

                It seems nobody knows how to get this and it is driving me nuts

                Mentioning the surrogate pair probably confused the issue. It certainly did for me. The following code should be what you want (with appropriate bit manipulation.) If not then could you provide the unicode.org page reference for 2A601 (I couldn't find it.) // 2A601 byte[] array = new byte[4]; int i=0; array[i++] = (byte)0x0001; // Little endian array[i++] = (byte)0x00A6; array[i++] = (byte)0x0002; array[i++] = (byte)0x0000; String r = System.Text.Encoding.UTF32.GetString(array);

                V Offline
                V Offline
                vnmatt
                wrote on last edited by
                #10

                Thanks jschell. What Richard mentioned about it being 20 bits got me thinking this must be a UTF-32 character and after a LOT of research, I finally got the answer I needed and the solution is actually quite simple once you know what to look for: If you pass in "U+2A601" to this method of mine, you'll get the correct character returned. I think I may go ahead and post a tip/trick on this :-) Thanks to all for helping point me in the right direction.

                public static string ConvertUnicodeToCharacter(string unicodeValue)
                {
                int unicode = int.Parse(unicodeValue.Substring(2), NumberStyles.HexNumber);

                        if (unicodeValue.Length == 7) //UTF-32
                        {
                            return char.ConvertFromUtf32(unicode);
                        }
                
                        return ((char)unicode).ToString(); //UTF-16
                    }
                

                Cheers

                L J 2 Replies Last reply
                0
                • V vnmatt

                  Thanks jschell. What Richard mentioned about it being 20 bits got me thinking this must be a UTF-32 character and after a LOT of research, I finally got the answer I needed and the solution is actually quite simple once you know what to look for: If you pass in "U+2A601" to this method of mine, you'll get the correct character returned. I think I may go ahead and post a tip/trick on this :-) Thanks to all for helping point me in the right direction.

                  public static string ConvertUnicodeToCharacter(string unicodeValue)
                  {
                  int unicode = int.Parse(unicodeValue.Substring(2), NumberStyles.HexNumber);

                          if (unicodeValue.Length == 7) //UTF-32
                          {
                              return char.ConvertFromUtf32(unicode);
                          }
                  
                          return ((char)unicode).ToString(); //UTF-16
                      }
                  

                  Cheers

                  L Offline
                  L Offline
                  Lost User
                  wrote on last edited by
                  #11

                  I just came across this article[^] which may also contain some useful information.

                  Just say 'NO' to evaluated arguments for diadic functions! Ash

                  1 Reply Last reply
                  0
                  • V vnmatt

                    Thanks jschell. What Richard mentioned about it being 20 bits got me thinking this must be a UTF-32 character and after a LOT of research, I finally got the answer I needed and the solution is actually quite simple once you know what to look for: If you pass in "U+2A601" to this method of mine, you'll get the correct character returned. I think I may go ahead and post a tip/trick on this :-) Thanks to all for helping point me in the right direction.

                    public static string ConvertUnicodeToCharacter(string unicodeValue)
                    {
                    int unicode = int.Parse(unicodeValue.Substring(2), NumberStyles.HexNumber);

                            if (unicodeValue.Length == 7) //UTF-32
                            {
                                return char.ConvertFromUtf32(unicode);
                            }
                    
                            return ((char)unicode).ToString(); //UTF-16
                        }
                    

                    Cheers

                    J Offline
                    J Offline
                    jschell
                    wrote on last edited by
                    #12

                    I suspect the following would also work.

                    int unicode = int.Parse(unicodeValue.Substring(2), NumberStyles.HexNumber);
                    return char.ConvertFromUtf32(unicode);

                    1 Reply Last reply
                    0
                    • V vnmatt

                      Hi Guys, I want to know how to get the correct character from a unicode value which has a surrogate key. For example: Here is my code that works with most characters: public static char ConvertUnicodeToCharacter(string unicodeValue) { int unicode = int.Parse(unicodeValue.Substring(2), NumberStyles.HexNumber); return (char)unicode; } This will take in something (like U+3400 for example) and spit out the character (in this case: 㐀). When I try this one: U+2A601, I always get this: ꘁ, but it should be this: 𪘁. Info on surrogate pairs: "UTF-16 Encoding Form. The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair" - http://www.unicode.org/glossary/ I just want to convert U+2A601 to 𪘁. Anyone know how? Thanks

                      L Offline
                      L Offline
                      Liao Jichen
                      wrote on last edited by
                      #13

                      You can parse the "U+XXXXX" string into a integer of unicode, which in your case should be unicode=0x0002A601. Then you can ensurrogate to have two integers hi and lo.

                      int unicode, hi, lo;
                      unicode=0x0002A601;
                      hi = (unicode - 0x10000) / 0x400 + 0xD800;
                      lo = (unicode - 0x10000) % 0x400 + 0xDC00;

                      string s = new String(new char[] { Convert.ToChar(hi), Convert.ToChar(lo) });

                      The string s is 𪘁. BTW, dealing with unicode, you may want to use System.Globalization.StringInfo.GetTextElementEnumerator to get a TextElement, check MSDN for more details.

                      1 Reply Last reply
                      0
                      Reply
                      • Reply as topic
                      Log in to reply
                      • Oldest to Newest
                      • Newest to Oldest
                      • Most Votes


                      • Login

                      • Don't have an account? Register

                      • Login or register to search.
                      • First post
                        Last post
                      0
                      • Categories
                      • Recent
                      • Tags
                      • Popular
                      • World
                      • Users
                      • Groups