Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. Other Discussions
  3. Clever Code
  4. Microsoft CRT doesn't support Unicode properly for console applications.

Microsoft CRT doesn't support Unicode properly for console applications.

Scheduled Pinned Locked Moved Clever Code
combeta-testinghelpquestioncode-review
22 Posts 6 Posters 2 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • J Joe Woodbury

    The bug is in your code; 0x100 is not a valid console character.

    Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

    S Offline
    S Offline
    Stephen Hewitt
    wrote on last edited by
    #13

    See here[^] for a map of the console characters and their Unicode codes. See here[^] to see where I got this link from. The following is taken from the character chart:   9E = U+20A7 : PESETA SIGN So U+20A7 is a valid console character. But if I try the following program the problem remains: #include <iostream>   int wmain(int argc, wchar_t* argv[]) {       using namespace std;       wcout << L"So far so good!" << endl;       wcout << L"\x20A7";       wcout << L"Doesn't appear on console!" << endl;       return 0; } It is a bug and it doesn’t matter if the character is a valid console character or not. Microsoft admit the problem and are going to fix it.

    Steve

    J 1 Reply Last reply
    0
    • S Stephen Hewitt

      UTF+0100 is indeed a valid Unicode character. See here[^] from the Unicode homepage.

      Steve

      J Offline
      J Offline
      Jorgen Sigvardsson
      wrote on last edited by
      #14

      Yes, it is.. I didn't see the L-prefixes first. :-O (In fact, \x100 doesn't even make sense in terms of bytes :))

      -- For External Use Only

      1 Reply Last reply
      0
      • S Stephen Hewitt

        See here[^] for a map of the console characters and their Unicode codes. See here[^] to see where I got this link from. The following is taken from the character chart:   9E = U+20A7 : PESETA SIGN So U+20A7 is a valid console character. But if I try the following program the problem remains: #include <iostream>   int wmain(int argc, wchar_t* argv[]) {       using namespace std;       wcout << L"So far so good!" << endl;       wcout << L"\x20A7";       wcout << L"Doesn't appear on console!" << endl;       return 0; } It is a bug and it doesn’t matter if the character is a valid console character or not. Microsoft admit the problem and are going to fix it.

        Steve

        J Offline
        J Offline
        Joe Woodbury
        wrote on last edited by
        #15

        (First, after studying this and writing lots of test code, I agree that Microsoft should have created a solution for this. However, I still don't think it's technically their fault.) There are two things going on. Technically, the wcout is functioning as per the standard. There is an integrity error on output and therefore badbit is set. The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console". [EDIT: I read the pertinant part of the standard and it says: "Conversions between the two representations occur within the Standard C Library. The conversion rules can, in principal, be altered by a call to setlocale (page 71) that alters the category LC_CTYPE (page 68). Each wide stream determines the conversion rules at the time it becomes wide oriented, and retains these rules even if the category LC_CTYPE (page 68) subsequently changes." It seems that the console is arguably allowed to determine its conversion rules at the instance of the first call to wcout which would be to set its locale to the equivalent of "console." However, if standard output is not the console, then the locale it should use isn't defined.] Having said that, this is the 21st century and the CRT should understand that the string is being output to the console and operate accordingly. Microsoft should have come up with a documented solution, but I do think they were working within the standard. Ultimately, a big chunk of the blame is with the C and C++ standards committees which have done a horrible job with locale. Not only were they very late to the party, so to speak, their solution has been rather pathetic. (Finally, I there is arguably a design flaw in iostreams. You should have the option of having an exception being thrown on an error or having iostreams "make" it work--output something, anything, but keep on chugging.) -- modified at 3:56 Monday 30th October, 2006

        Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

        M S 2 Replies Last reply
        0
        • J Joe Woodbury

          (First, after studying this and writing lots of test code, I agree that Microsoft should have created a solution for this. However, I still don't think it's technically their fault.) There are two things going on. Technically, the wcout is functioning as per the standard. There is an integrity error on output and therefore badbit is set. The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console". [EDIT: I read the pertinant part of the standard and it says: "Conversions between the two representations occur within the Standard C Library. The conversion rules can, in principal, be altered by a call to setlocale (page 71) that alters the category LC_CTYPE (page 68). Each wide stream determines the conversion rules at the time it becomes wide oriented, and retains these rules even if the category LC_CTYPE (page 68) subsequently changes." It seems that the console is arguably allowed to determine its conversion rules at the instance of the first call to wcout which would be to set its locale to the equivalent of "console." However, if standard output is not the console, then the locale it should use isn't defined.] Having said that, this is the 21st century and the CRT should understand that the string is being output to the console and operate accordingly. Microsoft should have come up with a documented solution, but I do think they were working within the standard. Ultimately, a big chunk of the blame is with the C and C++ standards committees which have done a horrible job with locale. Not only were they very late to the party, so to speak, their solution has been rather pathetic. (Finally, I there is arguably a design flaw in iostreams. You should have the option of having an exception being thrown on an error or having iostreams "make" it work--output something, anything, but keep on chugging.) -- modified at 3:56 Monday 30th October, 2006

          Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

          M Offline
          M Offline
          Mike Dimmick
          wrote on last edited by
          #16

          The console functionality in Windows is a little weird. If you write to the console with the WriteConsole API, you can supply either a Unicode or a multibyte string, depending on whether you call WriteConsoleW or WriteConsoleA. However, if you write to the console with WriteFile, you must use byte-oriented ('ANSI') characters. It appears that WriteConsole has been given the same number and type of parameters as WriteFile to facilitate using a function pointer. The CRT ultimately uses WriteFile exclusively to write to streams, so it has to convert to multibyte strings. If the "C" locale is selected, wctomb_s simply fails the conversion with 'invalid sequence' if any character code over 255 is used. The actual character displayed in the console depends on the console's selected codepage. See SetConsoleOutputCP. Even if you can get the data passed to the console in the right form, you may still get the wrong output because the console's code page is not Unicode. You have to have a TrueType font selected (e.g. Lucida Console) in order to see the right glyphs. With the 'Language for non-Unicode programs' set to 'English (United Kingdom)', I could get a € (Euro symbol) to show with Lucida Console selected (using the WriteConsole API), but with the raster font selected it was a Ç (capital C with cedilla). U+0100 (capital A with macron) showed up just as A (U+0041, Latin Capital Letter A). We would only want Unicode passed directly to the output stream if the stream was actually hooked up to a console. If it was being passed to another program through a pipeline, we would want it to pass multibyte strings, for compatibility. (wcin converts from multibyte strings into UTF-16 using the 'current' locale.)

          Stability. What an interesting concept. -- Chris Maunder

          1 Reply Last reply
          0
          • J Joe Woodbury

            (First, after studying this and writing lots of test code, I agree that Microsoft should have created a solution for this. However, I still don't think it's technically their fault.) There are two things going on. Technically, the wcout is functioning as per the standard. There is an integrity error on output and therefore badbit is set. The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console". [EDIT: I read the pertinant part of the standard and it says: "Conversions between the two representations occur within the Standard C Library. The conversion rules can, in principal, be altered by a call to setlocale (page 71) that alters the category LC_CTYPE (page 68). Each wide stream determines the conversion rules at the time it becomes wide oriented, and retains these rules even if the category LC_CTYPE (page 68) subsequently changes." It seems that the console is arguably allowed to determine its conversion rules at the instance of the first call to wcout which would be to set its locale to the equivalent of "console." However, if standard output is not the console, then the locale it should use isn't defined.] Having said that, this is the 21st century and the CRT should understand that the string is being output to the console and operate accordingly. Microsoft should have come up with a documented solution, but I do think they were working within the standard. Ultimately, a big chunk of the blame is with the C and C++ standards committees which have done a horrible job with locale. Not only were they very late to the party, so to speak, their solution has been rather pathetic. (Finally, I there is arguably a design flaw in iostreams. You should have the option of having an exception being thrown on an error or having iostreams "make" it work--output something, anything, but keep on chugging.) -- modified at 3:56 Monday 30th October, 2006

            Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

            S Offline
            S Offline
            Stephen Hewitt
            wrote on last edited by
            #17

            Joe Woodbury wrote:

            The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console".

            As far as outputting Unicode characters go, the locale is irrelevant. It's relevant for such things as numerical formatting when outputting numbers or even outputting multi-byte characters but with a Unicode character the code point itself identifies the character unambiguously. Just because “wctomb_s” is being used doesn’t mean the character is “being converted properly”.

            Steve

            J 1 Reply Last reply
            0
            • S Stephen Hewitt

              Joe Woodbury wrote:

              The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console".

              As far as outputting Unicode characters go, the locale is irrelevant. It's relevant for such things as numerical formatting when outputting numbers or even outputting multi-byte characters but with a Unicode character the code point itself identifies the character unambiguously. Just because “wctomb_s” is being used doesn’t mean the character is “being converted properly”.

              Steve

              J Offline
              J Offline
              Joe Woodbury
              wrote on last edited by
              #18

              Stephen Hewitt wrote:

              As far as outputting Unicode characters go, the locale is irrelevant.

              The standard states that if the destination is composed of single byte characters than the UNICODE string must be converted. That is the situation here.

              Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

              S 1 Reply Last reply
              0
              • J Joe Woodbury

                Stephen Hewitt wrote:

                As far as outputting Unicode characters go, the locale is irrelevant.

                The standard states that if the destination is composed of single byte characters than the UNICODE string must be converted. That is the situation here.

                Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

                S Offline
                S Offline
                Stephen Hewitt
                wrote on last edited by
                #19

                Joe Woodbury wrote:

                The standard states that if the destination is composed of single byte characters than the UNICODE string must be converted. That is the situation here.

                Yes, but the encoding used by the console (the destination) is not effected by the locale imbued in the stream in this case. There's also the fact that there is no reason to do a Unicode->MBCS conversion at all; simple use WriteConsoleW function. The fact is the the Unicode support in the CRT is poor and, following the links I gave in the OP, Microsoft admit this themselves.

                Steve

                J 1 Reply Last reply
                0
                • S Stephen Hewitt

                  Joe Woodbury wrote:

                  The standard states that if the destination is composed of single byte characters than the UNICODE string must be converted. That is the situation here.

                  Yes, but the encoding used by the console (the destination) is not effected by the locale imbued in the stream in this case. There's also the fact that there is no reason to do a Unicode->MBCS conversion at all; simple use WriteConsoleW function. The fact is the the Unicode support in the CRT is poor and, following the links I gave in the OP, Microsoft admit this themselves.

                  Steve

                  J Offline
                  J Offline
                  Joe Woodbury
                  wrote on last edited by
                  #20

                  Stephen Hewitt wrote:

                  simple use WriteConsoleW function

                  It seems you can't redirect output done with WriteConsoleW.

                  Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

                  S 2 Replies Last reply
                  0
                  • J Joe Woodbury

                    Stephen Hewitt wrote:

                    simple use WriteConsoleW function

                    It seems you can't redirect output done with WriteConsoleW.

                    Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

                    S Offline
                    S Offline
                    Stephen Hewitt
                    wrote on last edited by
                    #21

                    True, but it's easy enough to detect if the output handle (returned from the GetStdHandle API) is a console handle and to behave appropriately.

                    Steve

                    1 Reply Last reply
                    0
                    • J Joe Woodbury

                      Stephen Hewitt wrote:

                      simple use WriteConsoleW function

                      It seems you can't redirect output done with WriteConsoleW.

                      Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

                      S Offline
                      S Offline
                      Stephen Hewitt
                      wrote on last edited by
                      #22

                      See here[^] for an example of how Microsoft already use the technique mentioned in my previous post.

                      Steve

                      1 Reply Last reply
                      0
                      Reply
                      • Reply as topic
                      Log in to reply
                      • Oldest to Newest
                      • Newest to Oldest
                      • Most Votes


                      • Login

                      • Don't have an account? Register

                      • Login or register to search.
                      • First post
                        Last post
                      0
                      • Categories
                      • Recent
                      • Tags
                      • Popular
                      • World
                      • Users
                      • Groups