Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. Other Discussions
  3. Clever Code
  4. Microsoft CRT doesn't support Unicode properly for console applications.

Microsoft CRT doesn't support Unicode properly for console applications.

Scheduled Pinned Locked Moved Clever Code
combeta-testinghelpquestioncode-review
22 Posts 6 Posters 3 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • J Jorgen Sigvardsson

    Well, it all depends on what comes after the 0x100 byte. If it's a valid UTF-16 "glyph" (not sure if character is a correct word when it comes to Unicode), then obviosuly there wouldn't be any problems. If it's not a valid character, well, then there's a "syntax error" and should be dealt with accordingly. For a console driver, that should (in my opinion) mean the same thing as trying to print a non-printable ASCII character. A beep, question mark, or no action would be appropriate. It shouldn't just roll over and die...

    -- [LIVE] From Omicron Persei 8

    J Offline
    J Offline
    Joe Woodbury
    wrote on last edited by
    #8

    Further checking shows that Microsoft is implementing proper behavior. If there is a loss of integrity in the output stream, badbit is set and an exception may or may not be thrown (I'm not positive on the last part, but did run across an reference to this possibility.) FYI, 0x100 does represent a valid UTF-16 glyph (an upper case A with a bar over it) but it isn't supported by all font classes nor by the console. Since it cannot be validly represented, it is a loss of integrity. (This illustrates an issue with iostreams which consider the source/destination of data to be an abstraction. Consistant behavior for the base iostreams class therefore insists that any illegal character be treated as an error. Personally, I'd rather have it throw an exception.)

    Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

    J 1 Reply Last reply
    0
    • J Joe Woodbury

      Further checking shows that Microsoft is implementing proper behavior. If there is a loss of integrity in the output stream, badbit is set and an exception may or may not be thrown (I'm not positive on the last part, but did run across an reference to this possibility.) FYI, 0x100 does represent a valid UTF-16 glyph (an upper case A with a bar over it) but it isn't supported by all font classes nor by the console. Since it cannot be validly represented, it is a loss of integrity. (This illustrates an issue with iostreams which consider the source/destination of data to be an abstraction. Consistant behavior for the base iostreams class therefore insists that any illegal character be treated as an error. Personally, I'd rather have it throw an exception.)

      Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

      J Offline
      J Offline
      Jorgen Sigvardsson
      wrote on last edited by
      #9

      Joe Woodbury wrote:

      FYI, 0x100 does represent a valid UTF-16 glyph (an upper case A with a bar over it) but it isn't supported by all font classes nor by the console. Since it cannot be validly represented, it is a loss of integrity.

      It does not as a single byte. As a 16 bit word, yes. But now that I've had a second look at the code, I see the L-prefix, meaning \x100 is a 16 bit word. :-O

      Joe Woodbury wrote:

      Personally, I'd rather have it throw an exception.

      I'd rather have the option to have it show a user defined character upon an unprintable character. Kind of like how the Win32 Wide->MultiByte string conversion functions work.

      -- Now with chucklelin

      1 Reply Last reply
      0
      • D David Crow

        Stephen Hewitt wrote:

        wcout << L"\x100";

        What character is ASCII 256? If it can't be properly rendered, is cout, or the console, just refusing to continue?


        "Approved Workmen Are Not Ashamed" - 2 Timothy 2:15

        "Judge not by the eye but by the heart." - Native American Proverb

        L Offline
        L Offline
        LittleGreenMartian
        wrote on last edited by
        #10

        ASCII 256 cannot encode an ASCII character 256 since it is a zero-based array. Character 255 is encoded in various places either as a hard space or a delete character. For reference see here: http://office.microsoft.com/en-us/assistance/HA011331361033.aspx

        S 1 Reply Last reply
        0
        • J Jorgen Sigvardsson

          It's not ASCII, it's UTF-16. 0x100 isn't a valid UTF-16 code.

          -- From the network that brought you "The Simpsons"

          S Offline
          S Offline
          Stephen Hewitt
          wrote on last edited by
          #11

          UTF+0100 is indeed a valid Unicode character. See here[^] from the Unicode homepage.

          Steve

          J 1 Reply Last reply
          0
          • L LittleGreenMartian

            ASCII 256 cannot encode an ASCII character 256 since it is a zero-based array. Character 255 is encoded in various places either as a hard space or a delete character. For reference see here: http://office.microsoft.com/en-us/assistance/HA011331361033.aspx

            S Offline
            S Offline
            Stephen Hewitt
            wrote on last edited by
            #12

            I'm not using ASCII; notice the L prefix before the string. I'm using UNICODE strings which are 16 bits (if you ignore surrogate pairs).

            Steve

            1 Reply Last reply
            0
            • J Joe Woodbury

              The bug is in your code; 0x100 is not a valid console character.

              Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

              S Offline
              S Offline
              Stephen Hewitt
              wrote on last edited by
              #13

              See here[^] for a map of the console characters and their Unicode codes. See here[^] to see where I got this link from. The following is taken from the character chart:   9E = U+20A7 : PESETA SIGN So U+20A7 is a valid console character. But if I try the following program the problem remains: #include <iostream>   int wmain(int argc, wchar_t* argv[]) {       using namespace std;       wcout << L"So far so good!" << endl;       wcout << L"\x20A7";       wcout << L"Doesn't appear on console!" << endl;       return 0; } It is a bug and it doesn’t matter if the character is a valid console character or not. Microsoft admit the problem and are going to fix it.

              Steve

              J 1 Reply Last reply
              0
              • S Stephen Hewitt

                UTF+0100 is indeed a valid Unicode character. See here[^] from the Unicode homepage.

                Steve

                J Offline
                J Offline
                Jorgen Sigvardsson
                wrote on last edited by
                #14

                Yes, it is.. I didn't see the L-prefixes first. :-O (In fact, \x100 doesn't even make sense in terms of bytes :))

                -- For External Use Only

                1 Reply Last reply
                0
                • S Stephen Hewitt

                  See here[^] for a map of the console characters and their Unicode codes. See here[^] to see where I got this link from. The following is taken from the character chart:   9E = U+20A7 : PESETA SIGN So U+20A7 is a valid console character. But if I try the following program the problem remains: #include <iostream>   int wmain(int argc, wchar_t* argv[]) {       using namespace std;       wcout << L"So far so good!" << endl;       wcout << L"\x20A7";       wcout << L"Doesn't appear on console!" << endl;       return 0; } It is a bug and it doesn’t matter if the character is a valid console character or not. Microsoft admit the problem and are going to fix it.

                  Steve

                  J Offline
                  J Offline
                  Joe Woodbury
                  wrote on last edited by
                  #15

                  (First, after studying this and writing lots of test code, I agree that Microsoft should have created a solution for this. However, I still don't think it's technically their fault.) There are two things going on. Technically, the wcout is functioning as per the standard. There is an integrity error on output and therefore badbit is set. The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console". [EDIT: I read the pertinant part of the standard and it says: "Conversions between the two representations occur within the Standard C Library. The conversion rules can, in principal, be altered by a call to setlocale (page 71) that alters the category LC_CTYPE (page 68). Each wide stream determines the conversion rules at the time it becomes wide oriented, and retains these rules even if the category LC_CTYPE (page 68) subsequently changes." It seems that the console is arguably allowed to determine its conversion rules at the instance of the first call to wcout which would be to set its locale to the equivalent of "console." However, if standard output is not the console, then the locale it should use isn't defined.] Having said that, this is the 21st century and the CRT should understand that the string is being output to the console and operate accordingly. Microsoft should have come up with a documented solution, but I do think they were working within the standard. Ultimately, a big chunk of the blame is with the C and C++ standards committees which have done a horrible job with locale. Not only were they very late to the party, so to speak, their solution has been rather pathetic. (Finally, I there is arguably a design flaw in iostreams. You should have the option of having an exception being thrown on an error or having iostreams "make" it work--output something, anything, but keep on chugging.) -- modified at 3:56 Monday 30th October, 2006

                  Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

                  M S 2 Replies Last reply
                  0
                  • J Joe Woodbury

                    (First, after studying this and writing lots of test code, I agree that Microsoft should have created a solution for this. However, I still don't think it's technically their fault.) There are two things going on. Technically, the wcout is functioning as per the standard. There is an integrity error on output and therefore badbit is set. The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console". [EDIT: I read the pertinant part of the standard and it says: "Conversions between the two representations occur within the Standard C Library. The conversion rules can, in principal, be altered by a call to setlocale (page 71) that alters the category LC_CTYPE (page 68). Each wide stream determines the conversion rules at the time it becomes wide oriented, and retains these rules even if the category LC_CTYPE (page 68) subsequently changes." It seems that the console is arguably allowed to determine its conversion rules at the instance of the first call to wcout which would be to set its locale to the equivalent of "console." However, if standard output is not the console, then the locale it should use isn't defined.] Having said that, this is the 21st century and the CRT should understand that the string is being output to the console and operate accordingly. Microsoft should have come up with a documented solution, but I do think they were working within the standard. Ultimately, a big chunk of the blame is with the C and C++ standards committees which have done a horrible job with locale. Not only were they very late to the party, so to speak, their solution has been rather pathetic. (Finally, I there is arguably a design flaw in iostreams. You should have the option of having an exception being thrown on an error or having iostreams "make" it work--output something, anything, but keep on chugging.) -- modified at 3:56 Monday 30th October, 2006

                    Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

                    M Offline
                    M Offline
                    Mike Dimmick
                    wrote on last edited by
                    #16

                    The console functionality in Windows is a little weird. If you write to the console with the WriteConsole API, you can supply either a Unicode or a multibyte string, depending on whether you call WriteConsoleW or WriteConsoleA. However, if you write to the console with WriteFile, you must use byte-oriented ('ANSI') characters. It appears that WriteConsole has been given the same number and type of parameters as WriteFile to facilitate using a function pointer. The CRT ultimately uses WriteFile exclusively to write to streams, so it has to convert to multibyte strings. If the "C" locale is selected, wctomb_s simply fails the conversion with 'invalid sequence' if any character code over 255 is used. The actual character displayed in the console depends on the console's selected codepage. See SetConsoleOutputCP. Even if you can get the data passed to the console in the right form, you may still get the wrong output because the console's code page is not Unicode. You have to have a TrueType font selected (e.g. Lucida Console) in order to see the right glyphs. With the 'Language for non-Unicode programs' set to 'English (United Kingdom)', I could get a € (Euro symbol) to show with Lucida Console selected (using the WriteConsole API), but with the raster font selected it was a Ç (capital C with cedilla). U+0100 (capital A with macron) showed up just as A (U+0041, Latin Capital Letter A). We would only want Unicode passed directly to the output stream if the stream was actually hooked up to a console. If it was being passed to another program through a pipeline, we would want it to pass multibyte strings, for compatibility. (wcin converts from multibyte strings into UTF-16 using the 'current' locale.)

                    Stability. What an interesting concept. -- Chris Maunder

                    1 Reply Last reply
                    0
                    • J Joe Woodbury

                      (First, after studying this and writing lots of test code, I agree that Microsoft should have created a solution for this. However, I still don't think it's technically their fault.) There are two things going on. Technically, the wcout is functioning as per the standard. There is an integrity error on output and therefore badbit is set. The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console". [EDIT: I read the pertinant part of the standard and it says: "Conversions between the two representations occur within the Standard C Library. The conversion rules can, in principal, be altered by a call to setlocale (page 71) that alters the category LC_CTYPE (page 68). Each wide stream determines the conversion rules at the time it becomes wide oriented, and retains these rules even if the category LC_CTYPE (page 68) subsequently changes." It seems that the console is arguably allowed to determine its conversion rules at the instance of the first call to wcout which would be to set its locale to the equivalent of "console." However, if standard output is not the console, then the locale it should use isn't defined.] Having said that, this is the 21st century and the CRT should understand that the string is being output to the console and operate accordingly. Microsoft should have come up with a documented solution, but I do think they were working within the standard. Ultimately, a big chunk of the blame is with the C and C++ standards committees which have done a horrible job with locale. Not only were they very late to the party, so to speak, their solution has been rather pathetic. (Finally, I there is arguably a design flaw in iostreams. You should have the option of having an exception being thrown on an error or having iostreams "make" it work--output something, anything, but keep on chugging.) -- modified at 3:56 Monday 30th October, 2006

                      Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

                      S Offline
                      S Offline
                      Stephen Hewitt
                      wrote on last edited by
                      #17

                      Joe Woodbury wrote:

                      The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console".

                      As far as outputting Unicode characters go, the locale is irrelevant. It's relevant for such things as numerical formatting when outputting numbers or even outputting multi-byte characters but with a Unicode character the code point itself identifies the character unambiguously. Just because “wctomb_s” is being used doesn’t mean the character is “being converted properly”.

                      Steve

                      J 1 Reply Last reply
                      0
                      • S Stephen Hewitt

                        Joe Woodbury wrote:

                        The question then is, why isn't the character being converted properly? It's because the string is converted using wctomb_s which, by definition of the standard, uses the "C" locale. One way around this is to make a call to setlocale() before making the call to wcout. But there is no locale for the "console".

                        As far as outputting Unicode characters go, the locale is irrelevant. It's relevant for such things as numerical formatting when outputting numbers or even outputting multi-byte characters but with a Unicode character the code point itself identifies the character unambiguously. Just because “wctomb_s” is being used doesn’t mean the character is “being converted properly”.

                        Steve

                        J Offline
                        J Offline
                        Joe Woodbury
                        wrote on last edited by
                        #18

                        Stephen Hewitt wrote:

                        As far as outputting Unicode characters go, the locale is irrelevant.

                        The standard states that if the destination is composed of single byte characters than the UNICODE string must be converted. That is the situation here.

                        Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

                        S 1 Reply Last reply
                        0
                        • J Joe Woodbury

                          Stephen Hewitt wrote:

                          As far as outputting Unicode characters go, the locale is irrelevant.

                          The standard states that if the destination is composed of single byte characters than the UNICODE string must be converted. That is the situation here.

                          Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

                          S Offline
                          S Offline
                          Stephen Hewitt
                          wrote on last edited by
                          #19

                          Joe Woodbury wrote:

                          The standard states that if the destination is composed of single byte characters than the UNICODE string must be converted. That is the situation here.

                          Yes, but the encoding used by the console (the destination) is not effected by the locale imbued in the stream in this case. There's also the fact that there is no reason to do a Unicode->MBCS conversion at all; simple use WriteConsoleW function. The fact is the the Unicode support in the CRT is poor and, following the links I gave in the OP, Microsoft admit this themselves.

                          Steve

                          J 1 Reply Last reply
                          0
                          • S Stephen Hewitt

                            Joe Woodbury wrote:

                            The standard states that if the destination is composed of single byte characters than the UNICODE string must be converted. That is the situation here.

                            Yes, but the encoding used by the console (the destination) is not effected by the locale imbued in the stream in this case. There's also the fact that there is no reason to do a Unicode->MBCS conversion at all; simple use WriteConsoleW function. The fact is the the Unicode support in the CRT is poor and, following the links I gave in the OP, Microsoft admit this themselves.

                            Steve

                            J Offline
                            J Offline
                            Joe Woodbury
                            wrote on last edited by
                            #20

                            Stephen Hewitt wrote:

                            simple use WriteConsoleW function

                            It seems you can't redirect output done with WriteConsoleW.

                            Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

                            S 2 Replies Last reply
                            0
                            • J Joe Woodbury

                              Stephen Hewitt wrote:

                              simple use WriteConsoleW function

                              It seems you can't redirect output done with WriteConsoleW.

                              Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

                              S Offline
                              S Offline
                              Stephen Hewitt
                              wrote on last edited by
                              #21

                              True, but it's easy enough to detect if the output handle (returned from the GetStdHandle API) is a console handle and to behave appropriately.

                              Steve

                              1 Reply Last reply
                              0
                              • J Joe Woodbury

                                Stephen Hewitt wrote:

                                simple use WriteConsoleW function

                                It seems you can't redirect output done with WriteConsoleW.

                                Anyone who thinks he has a better idea of what's good for people than people do is a swine. - P.J. O'Rourke

                                S Offline
                                S Offline
                                Stephen Hewitt
                                wrote on last edited by
                                #22

                                See here[^] for an example of how Microsoft already use the technique mentioned in my previous post.

                                Steve

                                1 Reply Last reply
                                0
                                Reply
                                • Reply as topic
                                Log in to reply
                                • Oldest to Newest
                                • Newest to Oldest
                                • Most Votes


                                • Login

                                • Don't have an account? Register

                                • Login or register to search.
                                • First post
                                  Last post
                                0
                                • Categories
                                • Recent
                                • Tags
                                • Popular
                                • World
                                • Users
                                • Groups