Microsoft CRT doesn't support Unicode properly for console applications.
-
It's not ASCII, it's UTF-16. 0x100 isn't a valid UTF-16 code.
-- From the network that brought you "The Simpsons"
-
ASCII 256 cannot encode an ASCII character 256 since it is a zero-based array. Character 255 is encoded in various places either as a hard space or a delete character. For reference see here: http://office.microsoft.com/en-us/assistance/HA011331361033.aspx
I'm not using ASCII; notice the
L
prefix before the string. I'm using UNICODE strings which are 16 bits (if you ignore surrogate pairs).Steve
-
The bug is in your code; 0x100 is not a valid console character.
Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke
See here[^] for a map of the console characters and their Unicode codes. See here[^] to see where I got this link from. The following is taken from the character chart: 9E = U+20A7 : PESETA SIGN So U+20A7 is a valid console character. But if I try the following program the problem remains:
#include <iostream> int wmain(int argc, wchar_t* argv[]) { using namespace std; wcout << L"So far so good!" << endl; wcout << L"\x20A7"; wcout << L"Doesn't appear on console!" << endl; return 0; }
It is a bug and it doesn’t matter if the character is a valid console character or not. Microsoft admit the problem and are going to fix it.Steve
-
Yes, it is.. I didn't see the L-prefixes first. :-O (In fact, \x100 doesn't even make sense in terms of bytes :))
-- For External Use Only
-
See here[^] for a map of the console characters and their Unicode codes. See here[^] to see where I got this link from. The following is taken from the character chart: 9E = U+20A7 : PESETA SIGN So U+20A7 is a valid console character. But if I try the following program the problem remains:
#include <iostream> int wmain(int argc, wchar_t* argv[]) { using namespace std; wcout << L"So far so good!" << endl; wcout << L"\x20A7"; wcout << L"Doesn't appear on console!" << endl; return 0; }
It is a bug and it doesn’t matter if the character is a valid console character or not. Microsoft admit the problem and are going to fix it.Steve
(First, after studying this and writing lots of test code, I agree that Microsoft should have created a solution for this. However, I still don't think it's technically their fault.) There are two things going on. Technically, the wcout is functioning as per the standard. There is an integrity error on output and therefore badbit is set. The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console". [EDIT: I read the pertinant part of the standard and it says: "Conversions between the two representations occur within the Standard C Library. The conversion rules can, in principal, be altered by a call to setlocale (page 71) that alters the category LC_CTYPE (page 68). Each wide stream determines the conversion rules at the time it becomes wide oriented, and retains these rules even if the category LC_CTYPE (page 68) subsequently changes." It seems that the console is arguably allowed to determine its conversion rules at the instance of the first call to wcout which would be to set its locale to the equivalent of "console." However, if standard output is not the console, then the locale it should use isn't defined.] Having said that, this is the 21st century and the CRT should understand that the string is being output to the console and operate accordingly. Microsoft should have come up with a documented solution, but I do think they were working within the standard. Ultimately, a big chunk of the blame is with the C and C++ standards committees which have done a horrible job with locale. Not only were they very late to the party, so to speak, their solution has been rather pathetic. (Finally, I there is arguably a design flaw in iostreams. You should have the option of having an exception being thrown on an error or having iostreams "make" it work--output something, anything, but keep on chugging.) -- modified at 3:56 Monday 30th October, 2006
Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke
-
(First, after studying this and writing lots of test code, I agree that Microsoft should have created a solution for this. However, I still don't think it's technically their fault.) There are two things going on. Technically, the wcout is functioning as per the standard. There is an integrity error on output and therefore badbit is set. The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console". [EDIT: I read the pertinant part of the standard and it says: "Conversions between the two representations occur within the Standard C Library. The conversion rules can, in principal, be altered by a call to setlocale (page 71) that alters the category LC_CTYPE (page 68). Each wide stream determines the conversion rules at the time it becomes wide oriented, and retains these rules even if the category LC_CTYPE (page 68) subsequently changes." It seems that the console is arguably allowed to determine its conversion rules at the instance of the first call to wcout which would be to set its locale to the equivalent of "console." However, if standard output is not the console, then the locale it should use isn't defined.] Having said that, this is the 21st century and the CRT should understand that the string is being output to the console and operate accordingly. Microsoft should have come up with a documented solution, but I do think they were working within the standard. Ultimately, a big chunk of the blame is with the C and C++ standards committees which have done a horrible job with locale. Not only were they very late to the party, so to speak, their solution has been rather pathetic. (Finally, I there is arguably a design flaw in iostreams. You should have the option of having an exception being thrown on an error or having iostreams "make" it work--output something, anything, but keep on chugging.) -- modified at 3:56 Monday 30th October, 2006
Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke
The console functionality in Windows is a little weird. If you write to the console with the
WriteConsole
API, you can supply either a Unicode or a multibyte string, depending on whether you callWriteConsoleW
orWriteConsoleA
. However, if you write to the console withWriteFile
, you must use byte-oriented ('ANSI') characters. It appears thatWriteConsole
has been given the same number and type of parameters asWriteFile
to facilitate using a function pointer. The CRT ultimately usesWriteFile
exclusively to write to streams, so it has to convert to multibyte strings. If the "C" locale is selected,wctomb_s
simply fails the conversion with 'invalid sequence' if any character code over 255 is used. The actual character displayed in the console depends on the console's selected codepage. SeeSetConsoleOutputCP
. Even if you can get the data passed to the console in the right form, you may still get the wrong output because the console's code page is not Unicode. You have to have a TrueType font selected (e.g. Lucida Console) in order to see the right glyphs. With the 'Language for non-Unicode programs' set to 'English (United Kingdom)', I could get a € (Euro symbol) to show with Lucida Console selected (using the WriteConsole API), but with the raster font selected it was a Ç (capital C with cedilla). U+0100 (capital A with macron) showed up just as A (U+0041, Latin Capital Letter A). We would only want Unicode passed directly to the output stream if the stream was actually hooked up to a console. If it was being passed to another program through a pipeline, we would want it to pass multibyte strings, for compatibility. (wcin
converts from multibyte strings into UTF-16 using the 'current' locale.)Stability. What an interesting concept. -- Chris Maunder
-
(First, after studying this and writing lots of test code, I agree that Microsoft should have created a solution for this. However, I still don't think it's technically their fault.) There are two things going on. Technically, the wcout is functioning as per the standard. There is an integrity error on output and therefore badbit is set. The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console". [EDIT: I read the pertinant part of the standard and it says: "Conversions between the two representations occur within the Standard C Library. The conversion rules can, in principal, be altered by a call to setlocale (page 71) that alters the category LC_CTYPE (page 68). Each wide stream determines the conversion rules at the time it becomes wide oriented, and retains these rules even if the category LC_CTYPE (page 68) subsequently changes." It seems that the console is arguably allowed to determine its conversion rules at the instance of the first call to wcout which would be to set its locale to the equivalent of "console." However, if standard output is not the console, then the locale it should use isn't defined.] Having said that, this is the 21st century and the CRT should understand that the string is being output to the console and operate accordingly. Microsoft should have come up with a documented solution, but I do think they were working within the standard. Ultimately, a big chunk of the blame is with the C and C++ standards committees which have done a horrible job with locale. Not only were they very late to the party, so to speak, their solution has been rather pathetic. (Finally, I there is arguably a design flaw in iostreams. You should have the option of having an exception being thrown on an error or having iostreams "make" it work--output something, anything, but keep on chugging.) -- modified at 3:56 Monday 30th October, 2006
Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke
Joe Woodbury wrote:
The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console".
As far as outputting Unicode characters go, the locale is irrelevant. It's relevant for such things as numerical formatting when outputting numbers or even outputting multi-byte characters but with a Unicode character the code point itself identifies the character unambiguously. Just because “wctomb_s” is being used doesn’t mean the character is “being converted properly”.
Steve
-
Joe Woodbury wrote:
The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console".
As far as outputting Unicode characters go, the locale is irrelevant. It's relevant for such things as numerical formatting when outputting numbers or even outputting multi-byte characters but with a Unicode character the code point itself identifies the character unambiguously. Just because “wctomb_s” is being used doesn’t mean the character is “being converted properly”.
Steve
Stephen Hewitt wrote:
As far as outputting Unicode characters go, the locale is irrelevant.
The standard states that if the destination is composed of single byte characters than the UNICODE string must be converted. That is the situation here.
Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke
-
Stephen Hewitt wrote:
As far as outputting Unicode characters go, the locale is irrelevant.
The standard states that if the destination is composed of single byte characters than the UNICODE string must be converted. That is the situation here.
Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke
Joe Woodbury wrote:
The standard states that if the destination is composed of single byte characters than the UNICODE string must be converted. That is the situation here.
Yes, but the encoding used by the console (the destination) is not effected by the locale imbued in the stream in this case. There's also the fact that there is no reason to do a Unicode->MBCS conversion at all; simple use
WriteConsoleW
function. The fact is the the Unicode support in the CRT is poor and, following the links I gave in the OP, Microsoft admit this themselves.Steve
-
Joe Woodbury wrote:
The standard states that if the destination is composed of single byte characters than the UNICODE string must be converted. That is the situation here.
Yes, but the encoding used by the console (the destination) is not effected by the locale imbued in the stream in this case. There's also the fact that there is no reason to do a Unicode->MBCS conversion at all; simple use
WriteConsoleW
function. The fact is the the Unicode support in the CRT is poor and, following the links I gave in the OP, Microsoft admit this themselves.Steve
Stephen Hewitt wrote:
simple use WriteConsoleW function
It seems you can't redirect output done with WriteConsoleW.
Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke
-
Stephen Hewitt wrote:
simple use WriteConsoleW function
It seems you can't redirect output done with WriteConsoleW.
Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke
True, but it's easy enough to detect if the output handle (returned from the
GetStdHandle
API) is a console handle and to behave appropriately.Steve
-
Stephen Hewitt wrote:
simple use WriteConsoleW function
It seems you can't redirect output done with WriteConsoleW.
Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke