Converting char* to unicode big-endian

scchan1984

Hi, I have a char*. How to convert it to unicode big-endian in unmanagerd C++? Thanks

GKarRacer

Converting to unicode and converting to big-endian are 2 completely separate things. To convert to unicode use mbstowcs from the C runtime library or the Windows API call MultiByteToWideChar. Do you reall need it in big-endian? Are you sending the data over TCP or something? Anyway to convert to big-endian loop over each character in the unicode string buffer and either call htonl from the winsock library or manually swap the bytes yourself like: wchar_t bigendchar = ((littleendwchar & 0xFF) << 8) | ((littleendwchar & 0xFF00) >> 8);

Nemanja Trifunovic

MultibyteToWideChar[^]

My programming blahblahblah blog. If you ever find anything useful here, please let me know to remove it.

scchan1984

Actually I want to convert the string pointed by a char* to big-endian encoding, nothing deal with TCP/IP. Like the following managed C++ code but work in unmanaged C++: #include "stdafx.h" #using using namespace System; using namespace System::Text; int main() { String* unicodeString = S"This string contains the unicode character Pi(中)"; // Create two different encodings. Encoding * unicode = Encoding::Unicode; Encoding * bigendian = Encoding::BigEndianUnicode; // Convert the string into a Byte->Item[]. Byte unicodeBytes[] = unicode -> GetBytes(unicodeString); // Perform the conversion from one encoding to the other. Byte bigendianBytes[] = Encoding::Convert(unicode, bigendian, unicodeBytes); // Convert the new Byte into[] a char and[] then into a string. // This is a slightly different approach to converting to illustrate // the use of GetCharCount/GetChars. Char bigendianChars[] = new Char[bigendian ->GetCharCount(bigendianBytes, 0, bigendianBytes -> Length)]; bigendian -> GetChars(bigendianBytes, 0, bigendianBytes->Length, bigendianChars, 0); String* bigendianString = new String(bigendianChars); // Display the strings created before and after the conversion. Console::WriteLine(S"Original String*: {0}", unicodeString); Console::WriteLine(S"bigendian converted String*: {0}", bigendianString); }

Jose Lamas Rios

That code isn't actually converting from char* (ANSI string) to big-endian. Rather, it's converting a String object, which internally holds a unicode string encoded as UTF-16) to another String object, which internally will hold a unicode string, again encoded as UTF-16 but this time using big-endian. From end to end it's just swapping pairs of bytes. The entire process it's doing is as follows: 1. Get the internal buffer of the original string as a byte array 2. Convert the byte array to big-endian (swapping each pair of bytes) 3. Copy the byte array into an array of Unicode UTF-16 chars 4. Create a new String object from the array of Unicode UTF-16 big endian chars So, an equivalent in unmanaged C++ would be converting a wide char string to big-endian. I'm not aware of any standard C++ or Win32 API function that can be used to swap pairs of bytes, but something like the function below should work:

// Note nBufLen is the count of wide chars, not a byte count.
// nBufLen represents the buffer size and must include space
// for the NULL terminator
void ConvertToBigEndian(const wchar_t* pw, wchar_t* pwBuffer, int nBufLen)
{
   int i = 0;
   for( ;i < nBufLen && pw[i] != 0; i++)
   {
      const char* p = (const char*)&pw[i];
      char* q = (char*) &pwBuffer[i];

      q[0] = p[1];
      q[1] = p[0];
   }

   // terminate destination string
   if (i < nBufLen)
      pwBuffer[i] = 0;
   else
      pwBuffer[nBufLen-1] = 0;
}

Besides that, if you really need to start from a char* (MBCS/ANSI string), first convert it to wide chars (Unicode UTF-16) using MultiByteToWideChar. Hope that helps, -- jlr http://jlamas.blogspot.com/[^]

scchan1984

Thanks for everybody. I have constructed a method: char* UnicodeCharToBigEndianConverter(char* message) { wchar_t input[2000]; wchar_t output[2000]; MultiByteToWideChar( CP_ACP, // code page MB_COMPOSITE, // character-type options message, // string to map -1, // number of bytes in string input, // wide-character buffer strlen(message)+2 // size of buffer ); // escape character for unicode output[0]='\xFE\xFF'; MessageBoxW(NULL,input,L"Input",0); for(int i=1;i<=wcslen(input);i++) { output[i] = ((input[i-1] & 0xFF) << 8) | ((input[i-1] & 0xFF00) >> 8); } wcscat(output, (wchar_t*)"\x00\x00"); return (char*) output; } It works fine if the char* that I got from my application is a string with all unicode characters, e.g. all chinese characters. However, what if a user input a string with some chinese characters and some ASCII characters, e.g. A-Z, a-z? Is that whole string can be converted to big-endian using: output[i] = ((input[i-1] & 0xFF) << 8) | ((input[i-1] & 0xFF00) >> 8); Or we need to handle those ASCII characters?

Jose Lamas Rios

I hope you don't mind my comments :) scchan1984 wrote: wchar_t input[2000]; wchar_t output[2000]; MultiByteToWideChar( CP_ACP, // code page MB_COMPOSITE, // character-type options message, // string to map -1, // number of bytes in string input, // wide-character buffer strlen(message)+2 // size of buffer ); Those magic 2000 don't look well. If you are going to specify a maximum length of strings you support, you should at least check that the string you receive isn't longer than what you support. The last parameter to MultiByteToWideChar is wrong. There you are expected to pass the size of your buffer, which in this case is 2000. strlen() returns a char count of the string (not counting the null terminator), so for example, if you receive a message of 2100 chinese characters, strlen will return 4200. When you call to MultiByteToWideChar will try to write 2100 wide chars in a buffer that can only hold 2000, based in the erroneus information you gave it (you would be telling it that your buffer has space for 4202 wide chars instead of the actual 2000). The most likely result is a crash of your application. You should call MultiByteToWideChar first using 0 as the buffer size. That will return the required size for the buffer in wide chars. With that info, you can allocate a buffer in the heap, and then call MultiByteToWideChar again to do the conversion. scchan1984 wrote: // escape character for unicode output[0]='\xFE\xFF'; You defined both buffers with the same size, but if output will hold a byte oder marker, then it should be at least one wide char bigger, or you might not have enough space. scchan1984 wrote: for(int i=1;i<=wcslen(input);i++) You are calling wcslen in each iteration, making it traverse the entire string in the search for the NULL terminator. It would make more sense to call wcslen outside of the loop, store its value in a variable, and use the variable in the loop. Then again, you don't even need to call wcslen, as the length of input is what MultiByteToWideChar returns; you'd just need to receive it in a variable. scchan1984 wrote: output[i] = ((input[i-1] & 0xFF) << 8) | ((input[i-1] & 0xFF00) >> 8); This seems to be a more readable version of the byte swap I wrote in my previous post, right? :) scchan1984 wrote: wcscat(output, (wchar_t*)"\x00\x00"); Here you are again traversing the entire string

scchan1984

I got the answer!!! The reason is that when I change the ASCII character to big-endian, take 'B', with unicode \x42\x00, it become \x00\x42. And I used strlen to check the string length of (char* UnicodeCharToBigEndianConverter(char* message)), which finds the terminating character \x00, and gives me the wrong string length!! My code in the above post should be ok. Anyone interested in have a look in it and test it.

scchan1984

Really really thank you so much. I have to improve and test my code now :D

scchan1984

My improved version: WCHAR* UnicodeCharToBigEndianConverter(char* message) { // get the number of WCHAR from the message int nInputLen = MultiByteToWideChar( CP_ACP, // code page 0, // character-type options message, // string to map -1, // number of bytes in string NULL, // wide-character buffer NULL // size of buffer ); WCHAR* input = new WCHAR[nInputLen+1]; WCHAR* output = new WCHAR[nInputLen+1]; MultiByteToWideChar( CP_ACP, // code page MB_COMPOSITE, // character-type options message, // string to map -1, // number of bytes in string input, // wide-character buffer nInputLen+1 // size of buffer ); // escape character for unicode output[0]='\xFE\xFF'; for(int i=1;i<=nInputLen;i++) { output[i] = ((input[i-1] & 0xFF) << 8) | ((input[i-1] & 0xFF00) >> 8); //output[i] = input[i-1]; } output[nInputLen+1] = 0; return output; }

Jose Lamas Rios

It's much better now :) Note that you are allocating two buffers. You are returning one of them as the result. The other should be deleted before returning, to avoid leaking memory. Add the following line before the return: delete[] input; You should note that the callers of this function are responsible for deleting the result... Besides that, it seems to be already doing what it's supposed to do. Now, if you want to do it really right, keep reading... First I reccomend you read the following article by Joel Spolsky: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) [^] Now that you have read it (if not, go and read it, I'll wait right here), you surely understand why a name like UnicodeCharToBigEndian[...] doesn't make much sense for what this function does (i.e.: it's not receiving an Unicode string, but a multibyte string.) A better name would be MultiByteToBigEndianWideChar... One additional problem with your function is that it returns a buffer that must be released by the caller. This is a problematic approach because as a user of the function, I couldn't deduce that from the function signature alone. If all I see is a function declared in a header file as WCHAR* MultiByteToBigEndianWideChar(char* message); I have no way to know whether I should delete the returned buffer or not, or in case I'm expected to delete it, whether I should use delete[], free, or anything else. I'd have to rely on some documentation or on having access to the implementation itself in order to find out. To avoid this ambiguity, it's a common practice in this kind of functions to make the caller supply the buffer and its size as parameters to the function. When the buffer is supplied by the caller, the caller already knows if that buffer is allocated in the stack or in the heap, and how to release it in the latter case. That's exactly the approach followed by MultiByteToWideChar and all the API functions. In fact, you might start with a function with exactly the same parameters as MultiByteToWideChar

int MultiByteToBigEndianWideChar(
  UINT CodePage, // code page
  DWORD dwFlags, // character-type options
  LPCSTR lpMultiByteStr, // string to map
  int cbM