Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. Other Discussions
  3. Clever Code
  4. Microsoft CRT doesn't support Unicode properly for console applications.

Microsoft CRT doesn't support Unicode properly for console applications.

Scheduled Pinned Locked Moved Clever Code
combeta-testinghelpquestioncode-review
22 Posts 6 Posters 3 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • J Jorgen Sigvardsson

    It's not ASCII, it's UTF-16. 0x100 isn't a valid UTF-16 code.

    -- From the network that brought you "The Simpsons"

    S Offline
    S Offline
    Stephen Hewitt
    wrote on last edited by
    #11

    UTF+0100 is indeed a valid Unicode character. See here[^] from the Unicode homepage.

    Steve

    J 1 Reply Last reply
    0
    • L LittleGreenMartian

      ASCII 256 cannot encode an ASCII character 256 since it is a zero-based array. Character 255 is encoded in various places either as a hard space or a delete character. For reference see here: http://office.microsoft.com/en-us/assistance/HA011331361033.aspx

      S Offline
      S Offline
      Stephen Hewitt
      wrote on last edited by
      #12

      I'm not using ASCII; notice the L prefix before the string. I'm using UNICODE strings which are 16 bits (if you ignore surrogate pairs).

      Steve

      1 Reply Last reply
      0
      • J Joe Woodbury

        The bug is in your code; 0x100 is not a valid console character.

        Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

        S Offline
        S Offline
        Stephen Hewitt
        wrote on last edited by
        #13

        See here[^] for a map of the console characters and their Unicode codes. See here[^] to see where I got this link from. The following is taken from the character chart:   9E = U+20A7 : PESETA SIGN So U+20A7 is a valid console character. But if I try the following program the problem remains: #include <iostream>   int wmain(int argc, wchar_t* argv[]) {       using namespace std;       wcout << L"So far so good!" << endl;       wcout << L"\x20A7";       wcout << L"Doesn't appear on console!" << endl;       return 0; } It is a bug and it doesn’t matter if the character is a valid console character or not. Microsoft admit the problem and are going to fix it.

        Steve

        J 1 Reply Last reply
        0
        • S Stephen Hewitt

          UTF+0100 is indeed a valid Unicode character. See here[^] from the Unicode homepage.

          Steve

          J Offline
          J Offline
          Jorgen Sigvardsson
          wrote on last edited by
          #14

          Yes, it is.. I didn't see the L-prefixes first. :-O (In fact, \x100 doesn't even make sense in terms of bytes :))

          -- For External Use Only

          1 Reply Last reply
          0
          • S Stephen Hewitt

            See here[^] for a map of the console characters and their Unicode codes. See here[^] to see where I got this link from. The following is taken from the character chart:   9E = U+20A7 : PESETA SIGN So U+20A7 is a valid console character. But if I try the following program the problem remains: #include <iostream>   int wmain(int argc, wchar_t* argv[]) {       using namespace std;       wcout << L"So far so good!" << endl;       wcout << L"\x20A7";       wcout << L"Doesn't appear on console!" << endl;       return 0; } It is a bug and it doesn’t matter if the character is a valid console character or not. Microsoft admit the problem and are going to fix it.

            Steve

            J Offline
            J Offline
            Joe Woodbury
            wrote on last edited by
            #15

            (First, after studying this and writing lots of test code, I agree that Microsoft should have created a solution for this. However, I still don't think it's technically their fault.) There are two things going on. Technically, the wcout is functioning as per the standard. There is an integrity error on output and therefore badbit is set. The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console". [EDIT: I read the pertinant part of the standard and it says: "Conversions between the two representations occur within the Standard C Library. The conversion rules can, in principal, be altered by a call to setlocale (page 71) that alters the category LC_CTYPE (page 68). Each wide stream determines the conversion rules at the time it becomes wide oriented, and retains these rules even if the category LC_CTYPE (page 68) subsequently changes." It seems that the console is arguably allowed to determine its conversion rules at the instance of the first call to wcout which would be to set its locale to the equivalent of "console." However, if standard output is not the console, then the locale it should use isn't defined.] Having said that, this is the 21st century and the CRT should understand that the string is being output to the console and operate accordingly. Microsoft should have come up with a documented solution, but I do think they were working within the standard. Ultimately, a big chunk of the blame is with the C and C++ standards committees which have done a horrible job with locale. Not only were they very late to the party, so to speak, their solution has been rather pathetic. (Finally, I there is arguably a design flaw in iostreams. You should have the option of having an exception being thrown on an error or having iostreams "make" it work--output something, anything, but keep on chugging.) -- modified at 3:56 Monday 30th October, 2006

            Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

            M S 2 Replies Last reply
            0
            • J Joe Woodbury

              (First, after studying this and writing lots of test code, I agree that Microsoft should have created a solution for this. However, I still don't think it's technically their fault.) There are two things going on. Technically, the wcout is functioning as per the standard. There is an integrity error on output and therefore badbit is set. The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console". [EDIT: I read the pertinant part of the standard and it says: "Conversions between the two representations occur within the Standard C Library. The conversion rules can, in principal, be altered by a call to setlocale (page 71) that alters the category LC_CTYPE (page 68). Each wide stream determines the conversion rules at the time it becomes wide oriented, and retains these rules even if the category LC_CTYPE (page 68) subsequently changes." It seems that the console is arguably allowed to determine its conversion rules at the instance of the first call to wcout which would be to set its locale to the equivalent of "console." However, if standard output is not the console, then the locale it should use isn't defined.] Having said that, this is the 21st century and the CRT should understand that the string is being output to the console and operate accordingly. Microsoft should have come up with a documented solution, but I do think they were working within the standard. Ultimately, a big chunk of the blame is with the C and C++ standards committees which have done a horrible job with locale. Not only were they very late to the party, so to speak, their solution has been rather pathetic. (Finally, I there is arguably a design flaw in iostreams. You should have the option of having an exception being thrown on an error or having iostreams "make" it work--output something, anything, but keep on chugging.) -- modified at 3:56 Monday 30th October, 2006

              Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

              M Offline
              M Offline
              Mike Dimmick
              wrote on last edited by
              #16

              The console functionality in Windows is a little weird. If you write to the console with the WriteConsole API, you can supply either a Unicode or a multibyte string, depending on whether you call WriteConsoleW or WriteConsoleA. However, if you write to the console with WriteFile, you must use byte-oriented ('ANSI') characters. It appears that WriteConsole has been given the same number and type of parameters as WriteFile to facilitate using a function pointer. The CRT ultimately uses WriteFile exclusively to write to streams, so it has to convert to multibyte strings. If the "C" locale is selected, wctomb_s simply fails the conversion with 'invalid sequence' if any character code over 255 is used. The actual character displayed in the console depends on the console's selected codepage. See SetConsoleOutputCP. Even if you can get the data passed to the console in the right form, you may still get the wrong output because the console's code page is not Unicode. You have to have a TrueType font selected (e.g. Lucida Console) in order to see the right glyphs. With the 'Language for non-Unicode programs' set to 'English (United Kingdom)', I could get a € (Euro symbol) to show with Lucida Console selected (using the WriteConsole API), but with the raster font selected it was a Ç (capital C with cedilla). U+0100 (capital A with macron) showed up just as A (U+0041, Latin Capital Letter A). We would only want Unicode passed directly to the output stream if the stream was actually hooked up to a console. If it was being passed to another program through a pipeline, we would want it to pass multibyte strings, for compatibility. (wcin converts from multibyte strings into UTF-16 using the 'current' locale.)

              Stability. What an interesting concept. -- Chris Maunder

              1 Reply Last reply
              0
              • J Joe Woodbury

                (First, after studying this and writing lots of test code, I agree that Microsoft should have created a solution for this. However, I still don't think it's technically their fault.) There are two things going on. Technically, the wcout is functioning as per the standard. There is an integrity error on output and therefore badbit is set. The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console". [EDIT: I read the pertinant part of the standard and it says: "Conversions between the two representations occur within the Standard C Library. The conversion rules can, in principal, be altered by a call to setlocale (page 71) that alters the category LC_CTYPE (page 68). Each wide stream determines the conversion rules at the time it becomes wide oriented, and retains these rules even if the category LC_CTYPE (page 68) subsequently changes." It seems that the console is arguably allowed to determine its conversion rules at the instance of the first call to wcout which would be to set its locale to the equivalent of "console." However, if standard output is not the console, then the locale it should use isn't defined.] Having said that, this is the 21st century and the CRT should understand that the string is being output to the console and operate accordingly. Microsoft should have come up with a documented solution, but I do think they were working within the standard. Ultimately, a big chunk of the blame is with the C and C++ standards committees which have done a horrible job with locale. Not only were they very late to the party, so to speak, their solution has been rather pathetic. (Finally, I there is arguably a design flaw in iostreams. You should have the option of having an exception being thrown on an error or having iostreams "make" it work--output something, anything, but keep on chugging.) -- modified at 3:56 Monday 30th October, 2006

                Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

                S Offline
                S Offline
                Stephen Hewitt
                wrote on last edited by
                #17

                Joe Woodbury wrote:

                The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console".

                As far as outputting Unicode characters go, the locale is irrelevant. It's relevant for such things as numerical formatting when outputting numbers or even outputting multi-byte characters but with a Unicode character the code point itself identifies the character unambiguously. Just because “wctomb_s” is being used doesn’t mean the character is “being converted properly”.

                Steve

                J 1 Reply Last reply
                0
                • S Stephen Hewitt

                  Joe Woodbury wrote:

                  The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console".

                  As far as outputting Unicode characters go, the locale is irrelevant. It's relevant for such things as numerical formatting when outputting numbers or even outputting multi-byte characters but with a Unicode character the code point itself identifies the character unambiguously. Just because “wctomb_s” is being used doesn’t mean the character is “being converted properly”.

                  Steve

                  J Offline
                  J Offline
                  Joe Woodbury
                  wrote on last edited by
                  #18

                  Stephen Hewitt wrote:

                  As far as outputting Unicode characters go, the locale is irrelevant.

                  The standard states that if the destination is composed of single byte characters than the UNICODE string must be converted. That is the situation here.

                  Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

                  S 1 Reply Last reply
                  0
                  • J Joe Woodbury

                    Stephen Hewitt wrote:

                    As far as outputting Unicode characters go, the locale is irrelevant.

                    The standard states that if the destination is composed of single byte characters than the UNICODE string must be converted. That is the situation here.

                    Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

                    S Offline
                    S Offline
                    Stephen Hewitt
                    wrote on last edited by
                    #19

                    Joe Woodbury wrote:

                    The standard states that if the destination is composed of single byte characters than the UNICODE string must be converted. That is the situation here.

                    Yes, but the encoding used by the console (the destination) is not effected by the locale imbued in the stream in this case. There's also the fact that there is no reason to do a Unicode->MBCS conversion at all; simple use WriteConsoleW function. The fact is the the Unicode support in the CRT is poor and, following the links I gave in the OP, Microsoft admit this themselves.

                    Steve

                    J 1 Reply Last reply
                    0
                    • S Stephen Hewitt

                      Joe Woodbury wrote:

                      The standard states that if the destination is composed of single byte characters than the UNICODE string must be converted. That is the situation here.

                      Yes, but the encoding used by the console (the destination) is not effected by the locale imbued in the stream in this case. There's also the fact that there is no reason to do a Unicode->MBCS conversion at all; simple use WriteConsoleW function. The fact is the the Unicode support in the CRT is poor and, following the links I gave in the OP, Microsoft admit this themselves.

                      Steve

                      J Offline
                      J Offline
                      Joe Woodbury
                      wrote on last edited by
                      #20

                      Stephen Hewitt wrote:

                      simple use WriteConsoleW function

                      It seems you can't redirect output done with WriteConsoleW.

                      Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

                      S 2 Replies Last reply
                      0
                      • J Joe Woodbury

                        Stephen Hewitt wrote:

                        simple use WriteConsoleW function

                        It seems you can't redirect output done with WriteConsoleW.

                        Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

                        S Offline
                        S Offline
                        Stephen Hewitt
                        wrote on last edited by
                        #21

                        True, but it's easy enough to detect if the output handle (returned from the GetStdHandle API) is a console handle and to behave appropriately.

                        Steve

                        1 Reply Last reply
                        0
                        • J Joe Woodbury

                          Stephen Hewitt wrote:

                          simple use WriteConsoleW function

                          It seems you can't redirect output done with WriteConsoleW.

                          Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

                          S Offline
                          S Offline
                          Stephen Hewitt
                          wrote on last edited by
                          #22

                          See here[^] for an example of how Microsoft already use the technique mentioned in my previous post.

                          Steve

                          1 Reply Last reply
                          0
                          Reply
                          • Reply as topic
                          Log in to reply
                          • Oldest to Newest
                          • Newest to Oldest
                          • Most Votes


                          • Login

                          • Don't have an account? Register

                          • Login or register to search.
                          • First post
                            Last post
                          0
                          • Categories
                          • Recent
                          • Tags
                          • Popular
                          • World
                          • Users
                          • Groups