Converting a unicode value (with a surrogate pair) to a character in C#

Lost User

gordon_matt wrote:

I'm not sure what you're on about with the 20-bits thing

Hex value 2A601 is five digits which is 20-bits, Unicode characters are only 16 bits. I'm not sure what translation table is being used in the link you posted, but I would suggest an email to the person who made the website as the best step forward.

Just say 'NO' to evaluated arguments for diadic functions! Ash

jschell

Richard MacCutchan wrote:

Hex value 2A601 is five digits which is 20-bits, Unicode characters are only 16 bits

In what context? Unicode as a standard supports any number of character sets. Bits has nothing to do with that. Each set is demarked by a minimum representational size called a code point which can be 8 bits, 16 bits, 32 bits and even 64 bits. The full range of characters in each set (which is not necessarily all possible unicode characters) is represented by one or more code points. The OP is asking about surrogate pairs in a 16 bit set. A suitable description of that is found here. One can find it at unicode.org as well but is requires digging. http://en.wikipedia.org/wiki/UTF-16/UCS-2[^] And C# certainly supports surrogate pairs. The following references the upper limit of 10FFFF which would be 21 bits. And 2A601 is certainly in that range. http://msdn.microsoft.com/en-us/library/aa664669(v=vs.71).aspx[^] Perhaps you are stating that 2A601 is not a valid character in the character set?

jschell

gordon_matt wrote:

I want to know how to get the correct character from a unicode value which has a surrogate key. For example:

The data type 'char' only has 16 bits. For C# the full range of unicode characters extends to 10FFFF, which will not fit in a 'char'. The char represents a code point not a character in the full range of unicode character set. A surrogate pair is by definition two code points. So to represent it you must have two 'char' values. Alternatives would be to use an 'int' or a String.

vnmatt

Thanks for the reply. I understand this well enough, but maybe everyone here is not quite understanding what I want... Okay don't return a System.Char... give me a System.String...whatever.... I just want to do something like this: public string GetUnicodeCharacter(string uniValue) { //uniValue = "U+2A601" //Do some logic return myUnicodeCharacter; //string that will display 𪘁 } It seems nobody knows how to get this and it is driving me nuts. Yes I did email the author of that site (cojak.org), but no reply yet; maybe because of holidays, but I'll be lucky if he helps me anyway (not to mention his site is PHP, so might not be able to help with .NET). Maybe there is a way to get the surrogate values from the "U+2A601" string? Then I could do something (ugly as it may be) like this:

public string myImaginaryMethod()
{
string original = "\u2A601";
string highSurrogate = GetHighSurrogate(original); // returns "\uD869";
string lowSurrogate = GetLowSurrogate(original); returns "\uDE01";
return highSurrogate + lowSurrogate; //returns 𪘁
}

Surely someone must know how to get this value somehow...anyhow...I don't care how, so long as I can get 𪘁 from "U+2A601"; (or from "\u2A601" would be fine too) :-)

Lost User

Very interesting, and I stand corrected. But what is the answer to the OP's question?

Just say 'NO' to evaluated arguments for diadic functions! Ash

jschell

gordon_matt wrote:

It seems nobody knows how to get this and it is driving me nuts

Mentioning the surrogate pair probably confused the issue. It certainly did for me. The following code should be what you want (with appropriate bit manipulation.) If not then could you provide the unicode.org page reference for 2A601 (I couldn't find it.) // 2A601 byte[] array = new byte[4]; int i=0; array[i++] = (byte)0x0001; // Little endian array[i++] = (byte)0x00A6; array[i++] = (byte)0x0002; array[i++] = (byte)0x0000; String r = System.Text.Encoding.UTF32.GetString(array);

vnmatt

Thanks jschell. What Richard mentioned about it being 20 bits got me thinking this must be a UTF-32 character and after a LOT of research, I finally got the answer I needed and the solution is actually quite simple once you know what to look for: If you pass in "U+2A601" to this method of mine, you'll get the correct character returned. I think I may go ahead and post a tip/trick on this :-) Thanks to all for helping point me in the right direction.

public static string ConvertUnicodeToCharacter(string unicodeValue)
{
int unicode = int.Parse(unicodeValue.Substring(2), NumberStyles.HexNumber);

        if (unicodeValue.Length == 7) //UTF-32
        {
            return char.ConvertFromUtf32(unicode);
        }

        return ((char)unicode).ToString(); //UTF-16
    }

Cheers

Lost User

I just came across this article[^] which may also contain some useful information.

Just say 'NO' to evaluated arguments for diadic functions! Ash

jschell

I suspect the following would also work.

int unicode = int.Parse(unicodeValue.Substring(2), NumberStyles.HexNumber);
return char.ConvertFromUtf32(unicode);

Liao Jichen

You can parse the "U+XXXXX" string into a integer of unicode, which in your case should be unicode=0x0002A601. Then you can ensurrogate to have two integers hi and lo.

int unicode, hi, lo;
unicode=0x0002A601;
hi = (unicode - 0x10000) / 0x400 + 0xD800;
lo = (unicode - 0x10000) % 0x400 + 0xDC00;

string s = new String(new char[] { Convert.ToChar(hi), Convert.ToChar(lo) });

The string s is 𪘁. BTW, dealing with unicode, you may want to use System.Globalization.StringInfo.GetTextElementEnumerator to get a TextElement, check MSDN for more details.