Microsoft CRT doesn't support Unicode properly for console applications.

Joe Woodbury

Further checking shows that Microsoft is implementing proper behavior. If there is a loss of integrity in the output stream, badbit is set and an exception may or may not be thrown (I'm not positive on the last part, but did run across an reference to this possibility.) FYI, 0x100 does represent a valid UTF-16 glyph (an upper case A with a bar over it) but it isn't supported by all font classes nor by the console. Since it cannot be validly represented, it is a loss of integrity. (This illustrates an issue with iostreams which consider the source/destination of data to be an abstraction. Consistant behavior for the base iostreams class therefore insists that any illegal character be treated as an error. Personally, I'd rather have it throw an exception.)

Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

Jorgen Sigvardsson

Joe Woodbury wrote:

FYI, 0x100 does represent a valid UTF-16 glyph (an upper case A with a bar over it) but it isn't supported by all font classes nor by the console. Since it cannot be validly represented, it is a loss of integrity.

It does not as a single byte. As a 16 bit word, yes. But now that I've had a second look at the code, I see the L-prefix, meaning \x100 is a 16 bit word. :-O

Joe Woodbury wrote:

Personally, I'd rather have it throw an exception.

I'd rather have the option to have it show a user defined character upon an unprintable character. Kind of like how the Win32 Wide->MultiByte string conversion functions work.

-- Now with chucklelin

LittleGreenMartian

ASCII 256 cannot encode an ASCII character 256 since it is a zero-based array. Character 255 is encoded in various places either as a hard space or a delete character. For reference see here: http://office.microsoft.com/en-us/assistance/HA011331361033.aspx

Stephen Hewitt

UTF+0100 is indeed a valid Unicode character. See here[^] from the Unicode homepage.

Steve

Stephen Hewitt

I'm not using ASCII; notice the L prefix before the string. I'm using UNICODE strings which are 16 bits (if you ignore surrogate pairs).

Steve

Stephen Hewitt

See here[^] for a map of the console characters and their Unicode codes. See here[^] to see where I got this link from. The following is taken from the character chart: 9E = U+20A7 : PESETA SIGN So U+20A7 is a valid console character. But if I try the following program the problem remains: #include <iostream> int wmain(int argc, wchar_t* argv[]) { using namespace std; wcout << L"So far so good!" << endl; wcout << L"\x20A7"; wcout << L"Doesn't appear on console!" << endl; return 0; } It is a bug and it doesn’t matter if the character is a valid console character or not. Microsoft admit the problem and are going to fix it.

Steve

Jorgen Sigvardsson

Yes, it is.. I didn't see the L-prefixes first. :-O (In fact, \x100 doesn't even make sense in terms of bytes :))

-- For External Use Only

Joe Woodbury

(First, after studying this and writing lots of test code, I agree that Microsoft should have created a solution for this. However, I still don't think it's technically their fault.) There are two things going on. Technically, the wcout is functioning as per the standard. There is an integrity error on output and therefore badbit is set. The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console". [EDIT: I read the pertinant part of the standard and it says: "Conversions between the two representations occur within the Standard C Library. The conversion rules can, in principal, be altered by a call to setlocale (page 71) that alters the category LC_CTYPE (page 68). Each wide stream determines the conversion rules at the time it becomes wide oriented, and retains these rules even if the category LC_CTYPE (page 68) subsequently changes." It seems that the console is arguably allowed to determine its conversion rules at the instance of the first call to wcout which would be to set its locale to the equivalent of "console." However, if standard output is not the console, then the locale it should use isn't defined.] Having said that, this is the 21st century and the CRT should understand that the string is being output to the console and operate accordingly. Microsoft should have come up with a documented solution, but I do think they were working within the standard. Ultimately, a big chunk of the blame is with the C and C++ standards committees which have done a horrible job with locale. Not only were they very late to the party, so to speak, their solution has been rather pathetic. (Finally, I there is arguably a design flaw in iostreams. You should have the option of having an exception being thrown on an error or having iostreams "make" it work--output something, anything, but keep on chugging.) -- modified at 3:56 Monday 30th October, 2006

Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

Mike Dimmick

The console functionality in Windows is a little weird. If you write to the console with the WriteConsole API, you can supply either a Unicode or a multibyte string, depending on whether you call WriteConsoleW or WriteConsoleA. However, if you write to the console with WriteFile, you must use byte-oriented ('ANSI') characters. It appears that WriteConsole has been given the same number and type of parameters as WriteFile to facilitate using a function pointer. The CRT ultimately uses WriteFile exclusively to write to streams, so it has to convert to multibyte strings. If the "C" locale is selected, wctomb_s simply fails the conversion with 'invalid sequence' if any character code over 255 is used. The actual character displayed in the console depends on the console's selected codepage. See SetConsoleOutputCP. Even if you can get the data passed to the console in the right form, you may still get the wrong output because the console's code page is not Unicode. You have to have a TrueType font selected (e.g. Lucida Console) in order to see the right glyphs. With the 'Language for non-Unicode programs' set to 'English (United Kingdom)', I could get a € (Euro symbol) to show with Lucida Console selected (using the WriteConsole API), but with the raster font selected it was a Ç (capital C with cedilla). U+0100 (capital A with macron) showed up just as A (U+0041, Latin Capital Letter A). We would only want Unicode passed directly to the output stream if the stream was actually hooked up to a console. If it was being passed to another program through a pipeline, we would want it to pass multibyte strings, for compatibility. (wcin converts from multibyte strings into UTF-16 using the 'current' locale.)

Stability. What an interesting concept. -- Chris Maunder

Stephen Hewitt

Joe Woodbury wrote:

The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console".

As far as outputting Unicode characters go, the locale is irrelevant. It's relevant for such things as numerical formatting when outputting numbers or even outputting multi-byte characters but with a Unicode character the code point itself identifies the character unambiguously. Just because “wctomb_s” is being used doesn’t mean the character is “being converted properly”.

Steve

Joe Woodbury

Stephen Hewitt wrote:

As far as outputting Unicode characters go, the locale is irrelevant.

The standard states that if the destination is composed of single byte characters than the UNICODE string must be converted. That is the situation here.

Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

Stephen Hewitt

Joe Woodbury wrote:

The standard states that if the destination is composed of single byte characters than the UNICODE string must be converted. That is the situation here.

Yes, but the encoding used by the console (the destination) is not effected by the locale imbued in the stream in this case. There's also the fact that there is no reason to do a Unicode->MBCS conversion at all; simple use WriteConsoleW function. The fact is the the Unicode support in the CRT is poor and, following the links I gave in the OP, Microsoft admit this themselves.

Steve

Joe Woodbury

Stephen Hewitt wrote:

simple use WriteConsoleW function

It seems you can't redirect output done with WriteConsoleW.

Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

Stephen Hewitt

True, but it's easy enough to detect if the output handle (returned from the GetStdHandle API) is a console handle and to behave appropriately.

Steve

Stephen Hewitt

See here[^] for an example of how Microsoft already use the technique mentioned in my previous post.

Steve