Was it a right decision to select UTF-16 as native encoding for NT and CE?

Vagif Abilov

As we know, UNICODE is a native encoding for Windows NT/2000 and Windows CE (on WinCE it is the only encoding supported by OS API functions). The word "UNICODE" does not actually refer to a single encoding set. There are several character sets that support multiple languages, and the one that Microsoft uses is UTF-16. UTF-16 letters take 2 bytes. One of alternatives is UTF-8 that is multibyte and takes 1 bytes for ASCII characters and 2-3 for others. When I first started working with UNICODE, I thought that selection of UTF-16 was smart, since developer just has to remember to use 2 bytes per character, that's it. However, the fact that as long as application operates with ASCII texts, it simply wastes half of memory allocated to the text strings, doesn't sound good. Moreover, storing text data in databases will result either in double space allocation, or - in case data are converted into UTF-8 or ASCII - it will require use of conversion routines in both ways. Let's imaging that Microsoft has selected UTF-8 as native encoding for their platforms. Pros: Efficient memory allocation; use of conversion rotines is only required when converting between UTF-8 and non-Latin texts, so in the USA most of the applications would be UNICODE-compliant by definition; files with Latin-only texts wouldn be just ASCII files. Contras: Computing string lenght becomes more complicated - you can't just search for terminating zero. In XML, for example, it is UTF-8 that is default encoding, and this is why we usually don't have any problems with just opening XML document in any text editor. I wonder if Microsoft chose UTF-8 as default for their OS, if they had a choice now. Vagif Win32/ATL/MFC Developer Oslo, Norway

James Pullicino

I think that Microsoft chose utf-16 because it was more accepted by developers (by developers I mean C++ programmers of course). Think about it, imagine you've been 3+ years programming with 1 byte chars, and all of a sudden MS tell you that in order to program for NT you need to start coding strings in a different manner. What unicode format would you, as a developer, have chosen? L"(2b || !2b)" ;P

Vagif Abilov

But developers who learned that character is just 1 byte would have to change their habits anyway! Yes, use of MBCS complicates string iteration. But learning this is not more difficult than writing _T("abc") instead of "abc". Answering your question: 5 years ago I would have chosen UTF-16. Now, working pretty much with WinCE (and Win32 of course), I would have chosen UTF-8. Vagif Win32/ATL/MFC Developer Oslo, Norway

James Pullicino

Transition is important. Maybe in 5 years time, Windows YQ will use utf-8, maybe one day... who knows. (2b || !2b)

Eric Kenslow

If only it were as easy as remembering to wrap your constants in the _T macro. Actually you need to call API functions to do _anything_ with a multibyte string, including simple iteration through the characters. This leads to a difficult to test mess that is a pain for developers, while providing no real benefit to users (users don't care if you're utf-8 or Unicode as long as you work in their language). Try to keep some perspective. Unicode is easier to write and test code for because it works consistently between (human) languages. Since pretty much all commercial software these days is going to be localized in one way or another, that's a big concern. Going to utf-8 would be a huge step backward- the _miniscule_ amount of memory that it saves is completely overpowered by its disadvantages. I just wish all the MS platforms supported Unicode so I didn't have to switch to multibyte when compiling apps for 9x. ;P -- Eric

James Pullicino

I agree. (2b || !2b)

Vagif Abilov

Well, you definitely have a point when you mention inconvenience of calling API to do anything with multibyte string. Anyway I think simplicity of UTF-16 could be combined with compactness of UTF-8 for data storage. Like they made a rule of using wide strings in COM interfaces, they could in addition to Axxx and Wxxx functions provide some kind of Uxxx (for UTF-8) I/O functions that would convert on the fly wide strings before storing them. Of course, it's no problem to make those, but making them available from MS would reduce the number of ASCII text files stored in wide format. Just my 2 copecks. Vagif Win32/ATL/MFC Developer Oslo, Norway

Vagif Abilov

Another thing that I forgot to mention in my previous reply is that Unicode in general uses also variable-length characters! (Most developers simply don't know). Here is a quotes from MSDN: "There is a need to support more characters than the 65,536 that fit in the 16-bit Unicode code space. For example, the Chinese speaking community alone uses over 55,000 characters. To answer this need, the Unicode Standard defines surrogates. A surrogate or surrogate pair is a pair of 16-bit Unicode code values that represent a single character. The first (high) surrogate is a 16-bit code value in the range U+D800 to U+DBFF. The second (low) surrogate is a 16-bit code value in the range U+DC00 to U+DFFF. Using surrogates, Unicode can support over one million characters. For more details about surrogates, refer to The Unicode Standard, version 2.0. Windows 2000 and Whistler provide support for basic input, output, and simple sorting of surrogates. However, not all system components are surrogate compatible. Also, surrogates are not supported in Windows 95/98/Me or in Windows NT 4.0." Here is what C# docs say: "Since C# uses a 16-bit encoding of Unicode characters in characters and strings, a Unicode character in the range U+10000 to U+10FFFF is represented using two Unicode “surrogate” characters." In fact only 4-byte character representation (UCS-4) can guarantee fixed lenght characters (on Earth at least ;) ). This is why I think it could be smart to have two character layers: one representing them in volatile memory (using UCS-4 32-bits integers) - and then we could treat them as array elements, and another one for storage purposes that would compact everything into UTF-8. Vagif Win32/ATL/MFC Developer Oslo, Norway

Vagif Abilov

Please look at one of my replies to Eric. Basically 16-bit Unicode is _NOT_ fixed length format, so it gives an simplicity of accessing "traditional" languages, not exotic symbols. Vagif Win32/ATL/MFC Developer Oslo, Norway