UTF8 Encoding - need help or explanation
-
Hi everyone, I'm having a strange problem with unicode encoding in C# / Macromedia Flash, and I think I need a little explanation to make sure I understand WHERE is the problem. :) So, in C#, we have
Encoding.UTF8
andEncoding.Unicode
.Encoding.UTF8
will encode ASCII characters into 8 bits, and all other characters as 16 bits (accented characters, etc). On the other hand,Encoding.Unicode
is actually UTF16 and will encode all characters into 16 bits. The problem: latin small letter s with caron - š - with character code 0x0161. This letter is encoded into 0x6101 when usingEncoding.Unicode
or 0x0161 when usingEncoding.BigEndianUnicode
. However, when usingEncoding.UTF8
, this letter is encoded into 0xC5A1. In Macromedia Flash, strings are apparently encoded using UTF8, as the base ASCII characters are encoded into 8 bits, but the small letter s with caron - š - is encoded into 0x0161. So now I don't know why is it different in UTF8 in C#? Any clues will be highly appreciated... Rado
Radoslav Bielik http://www.neomyz.com/poll [^] - Get your own web poll
-
Hi everyone, I'm having a strange problem with unicode encoding in C# / Macromedia Flash, and I think I need a little explanation to make sure I understand WHERE is the problem. :) So, in C#, we have
Encoding.UTF8
andEncoding.Unicode
.Encoding.UTF8
will encode ASCII characters into 8 bits, and all other characters as 16 bits (accented characters, etc). On the other hand,Encoding.Unicode
is actually UTF16 and will encode all characters into 16 bits. The problem: latin small letter s with caron - š - with character code 0x0161. This letter is encoded into 0x6101 when usingEncoding.Unicode
or 0x0161 when usingEncoding.BigEndianUnicode
. However, when usingEncoding.UTF8
, this letter is encoded into 0xC5A1. In Macromedia Flash, strings are apparently encoded using UTF8, as the base ASCII characters are encoded into 8 bits, but the small letter s with caron - š - is encoded into 0x0161. So now I don't know why is it different in UTF8 in C#? Any clues will be highly appreciated... Rado
Radoslav Bielik http://www.neomyz.com/poll [^] - Get your own web poll
Yes, LATIN SMALL LETTER S WITH CARON[^] is U+0161. In little-endian UTF-16 this is the byte sequence 0x61 0x01, in big-endian UTF-16 0x01 0x61 and in UTF-8, 0xC5 0xA1. I'd look into how Flash encodes characters. See for example http://www.macromedia.com/support/flash/languages/unicode_in_flmx/[^]. Stability. What an interesting concept. -- Chris Maunder
-
Yes, LATIN SMALL LETTER S WITH CARON[^] is U+0161. In little-endian UTF-16 this is the byte sequence 0x61 0x01, in big-endian UTF-16 0x01 0x61 and in UTF-8, 0xC5 0xA1. I'd look into how Flash encodes characters. See for example http://www.macromedia.com/support/flash/languages/unicode_in_flmx/[^]. Stability. What an interesting concept. -- Chris Maunder
Thanks Mike, this makes sense! :) Now it seems that the Flash is actually using UTF-16 internally, not UTF-8. I will forward this to our Flash guy. One more question - is there any easy and straightforward way to convert UTF-16 representation to UTF-8 representation of a letter? [EDIT]I was thinking about an algorithm, or a simple script, not about the C#
Encoding.Convert
[/EDIT] Thanks again! Rado
Radoslav Bielik http://www.neomyz.com/poll [^] - Get your own web poll