Converting a unicode value (with a surrogate pair) to a character in C#
-
Here you can see the character: http://www.cojak.org/index.php?function=code_lookup&term=2A601[^] I'm not sure what you're on about with the 20-bits thing, but if you have a way for me to turn U+2A601 into that character in c#, that would be great. That's all I am after. :-)
gordon_matt wrote:
I'm not sure what you're on about with the 20-bits thing
Hex value 2A601 is five digits which is 20-bits, Unicode characters are only 16 bits. I'm not sure what translation table is being used in the link you posted, but I would suggest an email to the person who made the website as the best step forward.
Just say 'NO' to evaluated arguments for diadic functions! Ash
-
gordon_matt wrote:
I'm not sure what you're on about with the 20-bits thing
Hex value 2A601 is five digits which is 20-bits, Unicode characters are only 16 bits. I'm not sure what translation table is being used in the link you posted, but I would suggest an email to the person who made the website as the best step forward.
Just say 'NO' to evaluated arguments for diadic functions! Ash
Richard MacCutchan wrote:
Hex value 2A601 is five digits which is 20-bits, Unicode characters are only 16 bits
In what context? Unicode as a standard supports any number of character sets. Bits has nothing to do with that. Each set is demarked by a minimum representational size called a code point which can be 8 bits, 16 bits, 32 bits and even 64 bits. The full range of characters in each set (which is not necessarily all possible unicode characters) is represented by one or more code points. The OP is asking about surrogate pairs in a 16 bit set. A suitable description of that is found here. One can find it at unicode.org as well but is requires digging. http://en.wikipedia.org/wiki/UTF-16/UCS-2[^] And C# certainly supports surrogate pairs. The following references the upper limit of 10FFFF which would be 21 bits. And 2A601 is certainly in that range. http://msdn.microsoft.com/en-us/library/aa664669(v=vs.71).aspx[^] Perhaps you are stating that 2A601 is not a valid character in the character set?
-
Hi Guys, I want to know how to get the correct character from a unicode value which has a surrogate key. For example: Here is my code that works with most characters: public static char ConvertUnicodeToCharacter(string unicodeValue) { int unicode = int.Parse(unicodeValue.Substring(2), NumberStyles.HexNumber); return (char)unicode; } This will take in something (like U+3400 for example) and spit out the character (in this case: 㐀). When I try this one: U+2A601, I always get this: ꘁ, but it should be this: 𪘁. Info on surrogate pairs: "UTF-16 Encoding Form. The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair" - http://www.unicode.org/glossary/ I just want to convert U+2A601 to 𪘁. Anyone know how? Thanks
gordon_matt wrote:
I want to know how to get the correct character from a unicode value which has a surrogate key. For example:
The data type 'char' only has 16 bits. For C# the full range of unicode characters extends to 10FFFF, which will not fit in a 'char'. The char represents a code point not a character in the full range of unicode character set. A surrogate pair is by definition two code points. So to represent it you must have two 'char' values. Alternatives would be to use an 'int' or a String.
-
gordon_matt wrote:
I want to know how to get the correct character from a unicode value which has a surrogate key. For example:
The data type 'char' only has 16 bits. For C# the full range of unicode characters extends to 10FFFF, which will not fit in a 'char'. The char represents a code point not a character in the full range of unicode character set. A surrogate pair is by definition two code points. So to represent it you must have two 'char' values. Alternatives would be to use an 'int' or a String.
Thanks for the reply. I understand this well enough, but maybe everyone here is not quite understanding what I want... Okay don't return a System.Char... give me a System.String...whatever.... I just want to do something like this: public string GetUnicodeCharacter(string uniValue) { //uniValue = "U+2A601" //Do some logic return myUnicodeCharacter; //string that will display 𪘁 } It seems nobody knows how to get this and it is driving me nuts. Yes I did email the author of that site (cojak.org), but no reply yet; maybe because of holidays, but I'll be lucky if he helps me anyway (not to mention his site is PHP, so might not be able to help with .NET). Maybe there is a way to get the surrogate values from the "U+2A601" string? Then I could do something (ugly as it may be) like this:
public string myImaginaryMethod()
{
string original = "\u2A601";
string highSurrogate = GetHighSurrogate(original); // returns "\uD869";
string lowSurrogate = GetLowSurrogate(original); returns "\uDE01";
return highSurrogate + lowSurrogate; //returns 𪘁
}Surely someone must know how to get this value somehow...anyhow...I don't care how, so long as I can get 𪘁 from "U+2A601"; (or from "\u2A601" would be fine too) :-)
-
Richard MacCutchan wrote:
Hex value 2A601 is five digits which is 20-bits, Unicode characters are only 16 bits
In what context? Unicode as a standard supports any number of character sets. Bits has nothing to do with that. Each set is demarked by a minimum representational size called a code point which can be 8 bits, 16 bits, 32 bits and even 64 bits. The full range of characters in each set (which is not necessarily all possible unicode characters) is represented by one or more code points. The OP is asking about surrogate pairs in a 16 bit set. A suitable description of that is found here. One can find it at unicode.org as well but is requires digging. http://en.wikipedia.org/wiki/UTF-16/UCS-2[^] And C# certainly supports surrogate pairs. The following references the upper limit of 10FFFF which would be 21 bits. And 2A601 is certainly in that range. http://msdn.microsoft.com/en-us/library/aa664669(v=vs.71).aspx[^] Perhaps you are stating that 2A601 is not a valid character in the character set?
-
Thanks for the reply. I understand this well enough, but maybe everyone here is not quite understanding what I want... Okay don't return a System.Char... give me a System.String...whatever.... I just want to do something like this: public string GetUnicodeCharacter(string uniValue) { //uniValue = "U+2A601" //Do some logic return myUnicodeCharacter; //string that will display 𪘁 } It seems nobody knows how to get this and it is driving me nuts. Yes I did email the author of that site (cojak.org), but no reply yet; maybe because of holidays, but I'll be lucky if he helps me anyway (not to mention his site is PHP, so might not be able to help with .NET). Maybe there is a way to get the surrogate values from the "U+2A601" string? Then I could do something (ugly as it may be) like this:
public string myImaginaryMethod()
{
string original = "\u2A601";
string highSurrogate = GetHighSurrogate(original); // returns "\uD869";
string lowSurrogate = GetLowSurrogate(original); returns "\uDE01";
return highSurrogate + lowSurrogate; //returns 𪘁
}Surely someone must know how to get this value somehow...anyhow...I don't care how, so long as I can get 𪘁 from "U+2A601"; (or from "\u2A601" would be fine too) :-)
gordon_matt wrote:
It seems nobody knows how to get this and it is driving me nuts
Mentioning the surrogate pair probably confused the issue. It certainly did for me. The following code should be what you want (with appropriate bit manipulation.) If not then could you provide the unicode.org page reference for 2A601 (I couldn't find it.) // 2A601 byte[] array = new byte[4]; int i=0; array[i++] = (byte)0x0001; // Little endian array[i++] = (byte)0x00A6; array[i++] = (byte)0x0002; array[i++] = (byte)0x0000; String r = System.Text.Encoding.UTF32.GetString(array);
-
gordon_matt wrote:
It seems nobody knows how to get this and it is driving me nuts
Mentioning the surrogate pair probably confused the issue. It certainly did for me. The following code should be what you want (with appropriate bit manipulation.) If not then could you provide the unicode.org page reference for 2A601 (I couldn't find it.) // 2A601 byte[] array = new byte[4]; int i=0; array[i++] = (byte)0x0001; // Little endian array[i++] = (byte)0x00A6; array[i++] = (byte)0x0002; array[i++] = (byte)0x0000; String r = System.Text.Encoding.UTF32.GetString(array);
Thanks jschell. What Richard mentioned about it being 20 bits got me thinking this must be a UTF-32 character and after a LOT of research, I finally got the answer I needed and the solution is actually quite simple once you know what to look for: If you pass in "U+2A601" to this method of mine, you'll get the correct character returned. I think I may go ahead and post a tip/trick on this :-) Thanks to all for helping point me in the right direction.
public static string ConvertUnicodeToCharacter(string unicodeValue)
{
int unicode = int.Parse(unicodeValue.Substring(2), NumberStyles.HexNumber);if (unicodeValue.Length == 7) //UTF-32 { return char.ConvertFromUtf32(unicode); } return ((char)unicode).ToString(); //UTF-16 }
Cheers
-
Thanks jschell. What Richard mentioned about it being 20 bits got me thinking this must be a UTF-32 character and after a LOT of research, I finally got the answer I needed and the solution is actually quite simple once you know what to look for: If you pass in "U+2A601" to this method of mine, you'll get the correct character returned. I think I may go ahead and post a tip/trick on this :-) Thanks to all for helping point me in the right direction.
public static string ConvertUnicodeToCharacter(string unicodeValue)
{
int unicode = int.Parse(unicodeValue.Substring(2), NumberStyles.HexNumber);if (unicodeValue.Length == 7) //UTF-32 { return char.ConvertFromUtf32(unicode); } return ((char)unicode).ToString(); //UTF-16 }
Cheers
-
Thanks jschell. What Richard mentioned about it being 20 bits got me thinking this must be a UTF-32 character and after a LOT of research, I finally got the answer I needed and the solution is actually quite simple once you know what to look for: If you pass in "U+2A601" to this method of mine, you'll get the correct character returned. I think I may go ahead and post a tip/trick on this :-) Thanks to all for helping point me in the right direction.
public static string ConvertUnicodeToCharacter(string unicodeValue)
{
int unicode = int.Parse(unicodeValue.Substring(2), NumberStyles.HexNumber);if (unicodeValue.Length == 7) //UTF-32 { return char.ConvertFromUtf32(unicode); } return ((char)unicode).ToString(); //UTF-16 }
Cheers
-
Hi Guys, I want to know how to get the correct character from a unicode value which has a surrogate key. For example: Here is my code that works with most characters: public static char ConvertUnicodeToCharacter(string unicodeValue) { int unicode = int.Parse(unicodeValue.Substring(2), NumberStyles.HexNumber); return (char)unicode; } This will take in something (like U+3400 for example) and spit out the character (in this case: 㐀). When I try this one: U+2A601, I always get this: ꘁ, but it should be this: 𪘁. Info on surrogate pairs: "UTF-16 Encoding Form. The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair" - http://www.unicode.org/glossary/ I just want to convert U+2A601 to 𪘁. Anyone know how? Thanks
You can parse the "U+XXXXX" string into a integer of unicode, which in your case should be unicode=0x0002A601. Then you can ensurrogate to have two integers hi and lo.
int unicode, hi, lo;
unicode=0x0002A601;
hi = (unicode - 0x10000) / 0x400 + 0xD800;
lo = (unicode - 0x10000) % 0x400 + 0xDC00;string s = new String(new char[] { Convert.ToChar(hi), Convert.ToChar(lo) });
The string s is 𪘁. BTW, dealing with unicode, you may want to use System.Globalization.StringInfo.GetTextElementEnumerator to get a TextElement, check MSDN for more details.