unicode strings

Waldermort

And here we go with another problem. I am declaring a constant string within my code, wrapping it with the _T() macro. In a unicode build this translates to the L prefix. The problem I am having is, my string is actually a multibyte pathname, there are Chinese characters within the string. I am trying to use this string within a function but having problems. With an MBCS build, the string length is 60, but in UNICODE the length is 64. Any ideas?

fefe wyx

In MBCS strings each Chinese character takes two chars, but in wide char strings, each Chinese character only takes one wide char. That will cause the difference in string length.

Waldermort

In MBCS a Chinese character takes 2 bytes, in UNICODE it also takes 2 bytes. My question is, why does the 'L' macro not map the string correctly. Instead of leaving the Chinese caracters as they are it is breaking them up further, giving each character 4 bytes. How can I overcome this without passing the string through a MultiByteToWideChar() function? I'm sure this does not only apply to Chinese characters, it must be the same with any unicode character enetered into a litteral string.

Pierre Leclercq

I think you cannot escape the call to MultiByteToWideChar. Actually even two, one for getting the buffer size, and one for filling the buffer.

Waldermort

Yeah, I figured. I decided to use the ATL macros. But look at this snippet, I can't work it out.

USES_CONVERSION;
// uncompressed text
length *= 2;
char *text = new char [length+1];
memset(text,0,length+1);
WideCharToMultiByte(936/* GB_2312 */,0,(unsigned short *)data,length/2,text,length+1,NULL,NULL);
strings[_added_strings] = new String( A2T(text) );
_added_strings++;
delete[] text;
data += length;

Basically, I am reading from a byte array which contains UTF16 text. Here I am converting it to multibyte, but for a unicode build I want to leave it as it is. Problem is, if I turn that text into a wchar_t and memcpy the bytes over, I get no text. The only way I can do it is to first convert to multibyte, then convert it back to wide.

Pierre Leclercq

IMO, the difference in length should be related to the string representation. I would guess the unicode string starts with the length of the string. So with a simple memcpy you might incorrectly copy the string.

Justin Tay

L"" is not a macro, it's simply a means of indicating a unicode string literal. ie. whatever is within it has to be a valid unicode string. If you're specifying a MBCS string literal then it's within your unicode build that you have to do the conversion from MBCS (specifying the codepage the string is in) to unicode. You should also consider what happens if for the MBCS build, the user is not running with the codepage that the MBCS string is in. (MBCS strings are all codepage specific). So what happens then? Converting your MBCS string to unicode and then back to the user's current codepage would probably fail (being unable to do the mappings), and using that MBCS string ignoring the user's current codepage is very wrong and might represent invalid file path characters on that user's codepage. If you are using VC7 and up, you should use the new ATL conversion classes[^]instead though there are some differences.

Mike Dimmick

The behaviour will depend on the codepage that the compiler uses to read the source file, which is the user's default codepage - for a Western developer, that will normally be Windows-1252. Newer versions of the compilers I think can recognise the Byte Order Mark to handle UTF-16 or UTF-8 source code files (the BOM is the character U+FEFF, so if the file begins with the bytes 0xFF 0xFE, it's likely to be a UTF-16 little-endian file, while if it starts 0xEF 0xBB 0xBF, it's probably a UTF-8 file). In VS2003 and 2005, you can go to File, Advanced Save Options and specify the encoding to use for a given source file. You may get best results using either a UTF-8 or UTF-16 source file, if your environment supports it, or using the hex escapes to explicitly specify the characters. I'm not sure if Visual C++ supports the '\u' syntax for specifying Unicode characters directly - I don't think it does. Note that any Latin characters in your MBCS string will be encoded using one byte, not two: indeed, any character coloured white on, for example, this reference chart[^] (Simplified Chinese CP936). Finally, you're not explicitly encoding the translation of 'Program Files', are you? There are many locations that you can, and should, discover the localized path for using the SHGetSpecialFolderPath API.

Stability. What an interesting concept. -- Chris Maunder

Waldermort

The file I am reading is a standard Biff8 excel file, going by the docs for the file format I am reading it the correct way. As it happens I don't need to worry to much about dealing with multibyte characters, I live and work in China so I am surrounded by them on a daily basis. My problem is with unicode, utf-7 utf-8 utf-16, little endian, big endian, where does it end? Usually all my builds are MBCS to be compatible with win95 systems (don't ask), but lately I have begun to realise the importance of unicode. So it looks like I am going to have to start learning all over again.

Michael Dunn

#ifdef UNICODE
LPCWSTR str = L"\x4f60\x597d"; // ni hao
#else
LPCSTR str = "\x81\xf1\x8f\x11"; // NOTE: numbers just made up, this would the MBCS equivalent of U+4F60 U+597D
#endif

You can't do it with one literal because the contents of the literal have to match the character set that you're compiling for.

--Mike-- Visual C++ MVP :cool: LINKS~! Ericahist | PimpFish | CP SearchBar v3.0 | C++ Forum FAQ

Waldermort

So it's probably better to store all string literals in a string table, that way LoadString() can handle the mess. But saying that, the string table is converted to unicode, so won't the same thing happen?

Michael Dunn

No, because when you call LoadStringA(), the OS will convert the string to MBCS for you. (All ANSI APIs on NT work this way.)

--Mike-- Visual C++ MVP :cool: LINKS~! Ericahist | PimpFish | CP SearchBar v3.0 | C++ Forum FAQ