Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C / C++ / MFC
  4. unicode strings

unicode strings

Scheduled Pinned Locked Moved C / C++ / MFC
helpquestion
12 Posts 6 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • W Offline
    W Offline
    Waldermort
    wrote on last edited by
    #1

    And here we go with another problem. I am declaring a constant string within my code, wrapping it with the _T() macro. In a unicode build this translates to the L prefix. The problem I am having is, my string is actually a multibyte pathname, there are Chinese characters within the string. I am trying to use this string within a function but having problems. With an MBCS build, the string length is 60, but in UNICODE the length is 64. Any ideas?

    F M M 3 Replies Last reply
    0
    • W Waldermort

      And here we go with another problem. I am declaring a constant string within my code, wrapping it with the _T() macro. In a unicode build this translates to the L prefix. The problem I am having is, my string is actually a multibyte pathname, there are Chinese characters within the string. I am trying to use this string within a function but having problems. With an MBCS build, the string length is 60, but in UNICODE the length is 64. Any ideas?

      F Offline
      F Offline
      fefe wyx
      wrote on last edited by
      #2

      In MBCS strings each Chinese character takes two chars, but in wide char strings, each Chinese character only takes one wide char. That will cause the difference in string length.

      W 1 Reply Last reply
      0
      • F fefe wyx

        In MBCS strings each Chinese character takes two chars, but in wide char strings, each Chinese character only takes one wide char. That will cause the difference in string length.

        W Offline
        W Offline
        Waldermort
        wrote on last edited by
        #3

        In MBCS a Chinese character takes 2 bytes, in UNICODE it also takes 2 bytes. My question is, why does the 'L' macro not map the string correctly. Instead of leaving the Chinese caracters as they are it is breaking them up further, giving each character 4 bytes. How can I overcome this without passing the string through a MultiByteToWideChar() function? I'm sure this does not only apply to Chinese characters, it must be the same with any unicode character enetered into a litteral string.

        P 1 Reply Last reply
        0
        • W Waldermort

          In MBCS a Chinese character takes 2 bytes, in UNICODE it also takes 2 bytes. My question is, why does the 'L' macro not map the string correctly. Instead of leaving the Chinese caracters as they are it is breaking them up further, giving each character 4 bytes. How can I overcome this without passing the string through a MultiByteToWideChar() function? I'm sure this does not only apply to Chinese characters, it must be the same with any unicode character enetered into a litteral string.

          P Offline
          P Offline
          Pierre Leclercq
          wrote on last edited by
          #4

          I think you cannot escape the call to MultiByteToWideChar. Actually even two, one for getting the buffer size, and one for filling the buffer.

          W 1 Reply Last reply
          0
          • P Pierre Leclercq

            I think you cannot escape the call to MultiByteToWideChar. Actually even two, one for getting the buffer size, and one for filling the buffer.

            W Offline
            W Offline
            Waldermort
            wrote on last edited by
            #5

            Yeah, I figured. I decided to use the ATL macros. But look at this snippet, I can't work it out.

            USES_CONVERSION;
            // uncompressed text
            length *= 2;
            char *text = new char [length+1];
            memset(text,0,length+1);
            WideCharToMultiByte(936/* GB_2312 */,0,(unsigned short *)data,length/2,text,length+1,NULL,NULL);
            strings[_added_strings] = new String( A2T(text) );
            _added_strings++;
            delete[] text;
            data += length;

            Basically, I am reading from a byte array which contains UTF16 text. Here I am converting it to multibyte, but for a unicode build I want to leave it as it is. Problem is, if I turn that text into a wchar_t and memcpy the bytes over, I get no text. The only way I can do it is to first convert to multibyte, then convert it back to wide.

            P J 2 Replies Last reply
            0
            • W Waldermort

              Yeah, I figured. I decided to use the ATL macros. But look at this snippet, I can't work it out.

              USES_CONVERSION;
              // uncompressed text
              length *= 2;
              char *text = new char [length+1];
              memset(text,0,length+1);
              WideCharToMultiByte(936/* GB_2312 */,0,(unsigned short *)data,length/2,text,length+1,NULL,NULL);
              strings[_added_strings] = new String( A2T(text) );
              _added_strings++;
              delete[] text;
              data += length;

              Basically, I am reading from a byte array which contains UTF16 text. Here I am converting it to multibyte, but for a unicode build I want to leave it as it is. Problem is, if I turn that text into a wchar_t and memcpy the bytes over, I get no text. The only way I can do it is to first convert to multibyte, then convert it back to wide.

              P Offline
              P Offline
              Pierre Leclercq
              wrote on last edited by
              #6

              IMO, the difference in length should be related to the string representation. I would guess the unicode string starts with the length of the string. So with a simple memcpy you might incorrectly copy the string.

              1 Reply Last reply
              0
              • W Waldermort

                Yeah, I figured. I decided to use the ATL macros. But look at this snippet, I can't work it out.

                USES_CONVERSION;
                // uncompressed text
                length *= 2;
                char *text = new char [length+1];
                memset(text,0,length+1);
                WideCharToMultiByte(936/* GB_2312 */,0,(unsigned short *)data,length/2,text,length+1,NULL,NULL);
                strings[_added_strings] = new String( A2T(text) );
                _added_strings++;
                delete[] text;
                data += length;

                Basically, I am reading from a byte array which contains UTF16 text. Here I am converting it to multibyte, but for a unicode build I want to leave it as it is. Problem is, if I turn that text into a wchar_t and memcpy the bytes over, I get no text. The only way I can do it is to first convert to multibyte, then convert it back to wide.

                J Offline
                J Offline
                Justin Tay
                wrote on last edited by
                #7

                L"" is not a macro, it's simply a means of indicating a unicode string literal. ie. whatever is within it has to be a valid unicode string. If you're specifying a MBCS string literal then it's within your unicode build that you have to do the conversion from MBCS (specifying the codepage the string is in) to unicode. You should also consider what happens if for the MBCS build, the user is not running with the codepage that the MBCS string is in. (MBCS strings are all codepage specific). So what happens then? Converting your MBCS string to unicode and then back to the user's current codepage would probably fail (being unable to do the mappings), and using that MBCS string ignoring the user's current codepage is very wrong and might represent invalid file path characters on that user's codepage. If you are using VC7 and up, you should use the new ATL conversion classes[^]instead though there are some differences.

                1 Reply Last reply
                0
                • W Waldermort

                  And here we go with another problem. I am declaring a constant string within my code, wrapping it with the _T() macro. In a unicode build this translates to the L prefix. The problem I am having is, my string is actually a multibyte pathname, there are Chinese characters within the string. I am trying to use this string within a function but having problems. With an MBCS build, the string length is 60, but in UNICODE the length is 64. Any ideas?

                  M Offline
                  M Offline
                  Mike Dimmick
                  wrote on last edited by
                  #8

                  The behaviour will depend on the codepage that the compiler uses to read the source file, which is the user's default codepage - for a Western developer, that will normally be Windows-1252. Newer versions of the compilers I think can recognise the Byte Order Mark to handle UTF-16 or UTF-8 source code files (the BOM is the character U+FEFF, so if the file begins with the bytes 0xFF 0xFE, it's likely to be a UTF-16 little-endian file, while if it starts 0xEF 0xBB 0xBF, it's probably a UTF-8 file). In VS2003 and 2005, you can go to File, Advanced Save Options and specify the encoding to use for a given source file. You may get best results using either a UTF-8 or UTF-16 source file, if your environment supports it, or using the hex escapes to explicitly specify the characters. I'm not sure if Visual C++ supports the '\u' syntax for specifying Unicode characters directly - I don't think it does. Note that any Latin characters in your MBCS string will be encoded using one byte, not two: indeed, any character coloured white on, for example, this reference chart[^] (Simplified Chinese CP936). Finally, you're not explicitly encoding the translation of 'Program Files', are you? There are many locations that you can, and should, discover the localized path for using the SHGetSpecialFolderPath API.

                  Stability. What an interesting concept. -- Chris Maunder

                  W 1 Reply Last reply
                  0
                  • M Mike Dimmick

                    The behaviour will depend on the codepage that the compiler uses to read the source file, which is the user's default codepage - for a Western developer, that will normally be Windows-1252. Newer versions of the compilers I think can recognise the Byte Order Mark to handle UTF-16 or UTF-8 source code files (the BOM is the character U+FEFF, so if the file begins with the bytes 0xFF 0xFE, it's likely to be a UTF-16 little-endian file, while if it starts 0xEF 0xBB 0xBF, it's probably a UTF-8 file). In VS2003 and 2005, you can go to File, Advanced Save Options and specify the encoding to use for a given source file. You may get best results using either a UTF-8 or UTF-16 source file, if your environment supports it, or using the hex escapes to explicitly specify the characters. I'm not sure if Visual C++ supports the '\u' syntax for specifying Unicode characters directly - I don't think it does. Note that any Latin characters in your MBCS string will be encoded using one byte, not two: indeed, any character coloured white on, for example, this reference chart[^] (Simplified Chinese CP936). Finally, you're not explicitly encoding the translation of 'Program Files', are you? There are many locations that you can, and should, discover the localized path for using the SHGetSpecialFolderPath API.

                    Stability. What an interesting concept. -- Chris Maunder

                    W Offline
                    W Offline
                    Waldermort
                    wrote on last edited by
                    #9

                    The file I am reading is a standard Biff8 excel file, going by the docs for the file format I am reading it the correct way. As it happens I don't need to worry to much about dealing with multibyte characters, I live and work in China so I am surrounded by them on a daily basis. My problem is with unicode, utf-7 utf-8 utf-16, little endian, big endian, where does it end? Usually all my builds are MBCS to be compatible with win95 systems (don't ask), but lately I have begun to realise the importance of unicode. So it looks like I am going to have to start learning all over again.

                    1 Reply Last reply
                    0
                    • W Waldermort

                      And here we go with another problem. I am declaring a constant string within my code, wrapping it with the _T() macro. In a unicode build this translates to the L prefix. The problem I am having is, my string is actually a multibyte pathname, there are Chinese characters within the string. I am trying to use this string within a function but having problems. With an MBCS build, the string length is 60, but in UNICODE the length is 64. Any ideas?

                      M Offline
                      M Offline
                      Michael Dunn
                      wrote on last edited by
                      #10

                      #ifdef UNICODE
                      LPCWSTR str = L"\x4f60\x597d"; // ni hao
                      #else
                      LPCSTR str = "\x81\xf1\x8f\x11"; // NOTE: numbers just made up, this would the MBCS equivalent of U+4F60 U+597D
                      #endif

                      You can't do it with one literal because the contents of the literal have to match the character set that you're compiling for.

                      --Mike-- Visual C++ MVP :cool: LINKS~! Ericahist | PimpFish | CP SearchBar v3.0 | C++ Forum FAQ

                      W 1 Reply Last reply
                      0
                      • M Michael Dunn

                        #ifdef UNICODE
                        LPCWSTR str = L"\x4f60\x597d"; // ni hao
                        #else
                        LPCSTR str = "\x81\xf1\x8f\x11"; // NOTE: numbers just made up, this would the MBCS equivalent of U+4F60 U+597D
                        #endif

                        You can't do it with one literal because the contents of the literal have to match the character set that you're compiling for.

                        --Mike-- Visual C++ MVP :cool: LINKS~! Ericahist | PimpFish | CP SearchBar v3.0 | C++ Forum FAQ

                        W Offline
                        W Offline
                        Waldermort
                        wrote on last edited by
                        #11

                        So it's probably better to store all string literals in a string table, that way LoadString() can handle the mess. But saying that, the string table is converted to unicode, so won't the same thing happen?

                        M 1 Reply Last reply
                        0
                        • W Waldermort

                          So it's probably better to store all string literals in a string table, that way LoadString() can handle the mess. But saying that, the string table is converted to unicode, so won't the same thing happen?

                          M Offline
                          M Offline
                          Michael Dunn
                          wrote on last edited by
                          #12

                          No, because when you call LoadStringA(), the OS will convert the string to MBCS for you. (All ANSI APIs on NT work this way.)

                          --Mike-- Visual C++ MVP :cool: LINKS~! Ericahist | PimpFish | CP SearchBar v3.0 | C++ Forum FAQ

                          1 Reply Last reply
                          0
                          Reply
                          • Reply as topic
                          Log in to reply
                          • Oldest to Newest
                          • Newest to Oldest
                          • Most Votes


                          • Login

                          • Don't have an account? Register

                          • Login or register to search.
                          • First post
                            Last post
                          0
                          • Categories
                          • Recent
                          • Tags
                          • Popular
                          • World
                          • Users
                          • Groups