Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C / C++ / MFC
  4. Converting char* to unicode big-endian

Converting char* to unicode big-endian

Scheduled Pinned Locked Moved C / C++ / MFC
c++tutorialquestion
11 Posts 4 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • S Offline
    S Offline
    scchan1984
    wrote on last edited by
    #1

    Hi, I have a char*. How to convert it to unicode big-endian in unmanagerd C++? Thanks

    G N 2 Replies Last reply
    0
    • S scchan1984

      Hi, I have a char*. How to convert it to unicode big-endian in unmanagerd C++? Thanks

      G Offline
      G Offline
      GKarRacer
      wrote on last edited by
      #2

      Converting to unicode and converting to big-endian are 2 completely separate things. To convert to unicode use mbstowcs from the C runtime library or the Windows API call MultiByteToWideChar. Do you reall need it in big-endian? Are you sending the data over TCP or something? Anyway to convert to big-endian loop over each character in the unicode string buffer and either call htonl from the winsock library or manually swap the bytes yourself like: wchar_t bigendchar = ((littleendwchar & 0xFF) << 8) | ((littleendwchar & 0xFF00) >> 8);

      S 1 Reply Last reply
      0
      • S scchan1984

        Hi, I have a char*. How to convert it to unicode big-endian in unmanagerd C++? Thanks

        N Offline
        N Offline
        Nemanja Trifunovic
        wrote on last edited by
        #3

        MultibyteToWideChar[^]


        My programming blahblahblah blog. If you ever find anything useful here, please let me know to remove it.

        1 Reply Last reply
        0
        • G GKarRacer

          Converting to unicode and converting to big-endian are 2 completely separate things. To convert to unicode use mbstowcs from the C runtime library or the Windows API call MultiByteToWideChar. Do you reall need it in big-endian? Are you sending the data over TCP or something? Anyway to convert to big-endian loop over each character in the unicode string buffer and either call htonl from the winsock library or manually swap the bytes yourself like: wchar_t bigendchar = ((littleendwchar & 0xFF) << 8) | ((littleendwchar & 0xFF00) >> 8);

          S Offline
          S Offline
          scchan1984
          wrote on last edited by
          #4

          Actually I want to convert the string pointed by a char* to big-endian encoding, nothing deal with TCP/IP. Like the following managed C++ code but work in unmanaged C++: #include "stdafx.h" #using using namespace System; using namespace System::Text; int main() { String* unicodeString = S"This string contains the unicode character Pi(中)"; // Create two different encodings. Encoding * unicode = Encoding::Unicode; Encoding * bigendian = Encoding::BigEndianUnicode; // Convert the string into a Byte->Item[]. Byte unicodeBytes[] = unicode -> GetBytes(unicodeString); // Perform the conversion from one encoding to the other. Byte bigendianBytes[] = Encoding::Convert(unicode, bigendian, unicodeBytes); // Convert the new Byte into[] a char and[] then into a string. // This is a slightly different approach to converting to illustrate // the use of GetCharCount/GetChars. Char bigendianChars[] = new Char[bigendian ->GetCharCount(bigendianBytes, 0, bigendianBytes -> Length)]; bigendian -> GetChars(bigendianBytes, 0, bigendianBytes->Length, bigendianChars, 0); String* bigendianString = new String(bigendianChars); // Display the strings created before and after the conversion. Console::WriteLine(S"Original String*: {0}", unicodeString); Console::WriteLine(S"bigendian converted String*: {0}", bigendianString); }

          J 1 Reply Last reply
          0
          • S scchan1984

            Actually I want to convert the string pointed by a char* to big-endian encoding, nothing deal with TCP/IP. Like the following managed C++ code but work in unmanaged C++: #include "stdafx.h" #using using namespace System; using namespace System::Text; int main() { String* unicodeString = S"This string contains the unicode character Pi(中)"; // Create two different encodings. Encoding * unicode = Encoding::Unicode; Encoding * bigendian = Encoding::BigEndianUnicode; // Convert the string into a Byte->Item[]. Byte unicodeBytes[] = unicode -> GetBytes(unicodeString); // Perform the conversion from one encoding to the other. Byte bigendianBytes[] = Encoding::Convert(unicode, bigendian, unicodeBytes); // Convert the new Byte into[] a char and[] then into a string. // This is a slightly different approach to converting to illustrate // the use of GetCharCount/GetChars. Char bigendianChars[] = new Char[bigendian ->GetCharCount(bigendianBytes, 0, bigendianBytes -> Length)]; bigendian -> GetChars(bigendianBytes, 0, bigendianBytes->Length, bigendianChars, 0); String* bigendianString = new String(bigendianChars); // Display the strings created before and after the conversion. Console::WriteLine(S"Original String*: {0}", unicodeString); Console::WriteLine(S"bigendian converted String*: {0}", bigendianString); }

            J Offline
            J Offline
            Jose Lamas Rios
            wrote on last edited by
            #5

            That code isn't actually converting from char* (ANSI string) to big-endian. Rather, it's converting a String object, which internally holds a unicode string encoded as UTF-16) to another String object, which internally will hold a unicode string, again encoded as UTF-16 but this time using big-endian. From end to end it's just swapping pairs of bytes. The entire process it's doing is as follows: 1. Get the internal buffer of the original string as a byte array 2. Convert the byte array to big-endian (swapping each pair of bytes) 3. Copy the byte array into an array of Unicode UTF-16 chars 4. Create a new String object from the array of Unicode UTF-16 big endian chars So, an equivalent in unmanaged C++ would be converting a wide char string to big-endian. I'm not aware of any standard C++ or Win32 API function that can be used to swap pairs of bytes, but something like the function below should work:

            // Note nBufLen is the count of wide chars, not a byte count.
            // nBufLen represents the buffer size and must include space
            // for the NULL terminator
            void ConvertToBigEndian(const wchar_t* pw, wchar_t* pwBuffer, int nBufLen)
            {
               int i = 0;
               for( ;i < nBufLen && pw[i] != 0; i++)
               {
                  const char* p = (const char*)&pw[i];
                  char* q = (char*) &pwBuffer[i];
             
                  q[0] = p[1];
                  q[1] = p[0];
               }
             
               // terminate destination string
               if (i < nBufLen)
                  pwBuffer[i] = 0;
               else
                  pwBuffer[nBufLen-1] = 0;
            }

            Besides that, if you really need to start from a char* (MBCS/ANSI string), first convert it to wide chars (Unicode UTF-16) using MultiByteToWideChar. Hope that helps, -- jlr http://jlamas.blogspot.com/[^]

            S 1 Reply Last reply
            0
            • J Jose Lamas Rios

              That code isn't actually converting from char* (ANSI string) to big-endian. Rather, it's converting a String object, which internally holds a unicode string encoded as UTF-16) to another String object, which internally will hold a unicode string, again encoded as UTF-16 but this time using big-endian. From end to end it's just swapping pairs of bytes. The entire process it's doing is as follows: 1. Get the internal buffer of the original string as a byte array 2. Convert the byte array to big-endian (swapping each pair of bytes) 3. Copy the byte array into an array of Unicode UTF-16 chars 4. Create a new String object from the array of Unicode UTF-16 big endian chars So, an equivalent in unmanaged C++ would be converting a wide char string to big-endian. I'm not aware of any standard C++ or Win32 API function that can be used to swap pairs of bytes, but something like the function below should work:

              // Note nBufLen is the count of wide chars, not a byte count.
              // nBufLen represents the buffer size and must include space
              // for the NULL terminator
              void ConvertToBigEndian(const wchar_t* pw, wchar_t* pwBuffer, int nBufLen)
              {
                 int i = 0;
                 for( ;i < nBufLen && pw[i] != 0; i++)
                 {
                    const char* p = (const char*)&pw[i];
                    char* q = (char*) &pwBuffer[i];
               
                    q[0] = p[1];
                    q[1] = p[0];
                 }
               
                 // terminate destination string
                 if (i < nBufLen)
                    pwBuffer[i] = 0;
                 else
                    pwBuffer[nBufLen-1] = 0;
              }

              Besides that, if you really need to start from a char* (MBCS/ANSI string), first convert it to wide chars (Unicode UTF-16) using MultiByteToWideChar. Hope that helps, -- jlr http://jlamas.blogspot.com/[^]

              S Offline
              S Offline
              scchan1984
              wrote on last edited by
              #6

              Thanks for everybody. I have constructed a method: char* UnicodeCharToBigEndianConverter(char* message) { wchar_t input[2000]; wchar_t output[2000]; MultiByteToWideChar( CP_ACP, // code page MB_COMPOSITE, // character-type options message, // string to map -1, // number of bytes in string input, // wide-character buffer strlen(message)+2 // size of buffer ); // escape character for unicode output[0]='\xFE\xFF'; MessageBoxW(NULL,input,L"Input",0); for(int i=1;i<=wcslen(input);i++) { output[i] = ((input[i-1] & 0xFF) << 8) | ((input[i-1] & 0xFF00) >> 8); } wcscat(output, (wchar_t*)"\x00\x00"); return (char*) output; } It works fine if the char* that I got from my application is a string with all unicode characters, e.g. all chinese characters. However, what if a user input a string with some chinese characters and some ASCII characters, e.g. A-Z, a-z? Is that whole string can be converted to big-endian using: output[i] = ((input[i-1] & 0xFF) << 8) | ((input[i-1] & 0xFF00) >> 8); Or we need to handle those ASCII characters?

              J S 2 Replies Last reply
              0
              • S scchan1984

                Thanks for everybody. I have constructed a method: char* UnicodeCharToBigEndianConverter(char* message) { wchar_t input[2000]; wchar_t output[2000]; MultiByteToWideChar( CP_ACP, // code page MB_COMPOSITE, // character-type options message, // string to map -1, // number of bytes in string input, // wide-character buffer strlen(message)+2 // size of buffer ); // escape character for unicode output[0]='\xFE\xFF'; MessageBoxW(NULL,input,L"Input",0); for(int i=1;i<=wcslen(input);i++) { output[i] = ((input[i-1] & 0xFF) << 8) | ((input[i-1] & 0xFF00) >> 8); } wcscat(output, (wchar_t*)"\x00\x00"); return (char*) output; } It works fine if the char* that I got from my application is a string with all unicode characters, e.g. all chinese characters. However, what if a user input a string with some chinese characters and some ASCII characters, e.g. A-Z, a-z? Is that whole string can be converted to big-endian using: output[i] = ((input[i-1] & 0xFF) << 8) | ((input[i-1] & 0xFF00) >> 8); Or we need to handle those ASCII characters?

                J Offline
                J Offline
                Jose Lamas Rios
                wrote on last edited by
                #7

                I hope you don't mind my comments :) scchan1984 wrote: wchar_t input[2000]; wchar_t output[2000]; MultiByteToWideChar( CP_ACP, // code page MB_COMPOSITE, // character-type options message, // string to map -1, // number of bytes in string input, // wide-character buffer strlen(message)+2 // size of buffer ); Those magic 2000 don't look well. If you are going to specify a maximum length of strings you support, you should at least check that the string you receive isn't longer than what you support. The last parameter to MultiByteToWideChar is wrong. There you are expected to pass the size of your buffer, which in this case is 2000. strlen() returns a char count of the string (not counting the null terminator), so for example, if you receive a message of 2100 chinese characters, strlen will return 4200. When you call to MultiByteToWideChar will try to write 2100 wide chars in a buffer that can only hold 2000, based in the erroneus information you gave it (you would be telling it that your buffer has space for 4202 wide chars instead of the actual 2000). The most likely result is a crash of your application. You should call MultiByteToWideChar first using 0 as the buffer size. That will return the required size for the buffer in wide chars. With that info, you can allocate a buffer in the heap, and then call MultiByteToWideChar again to do the conversion. scchan1984 wrote: // escape character for unicode output[0]='\xFE\xFF'; You defined both buffers with the same size, but if output will hold a byte oder marker, then it should be at least one wide char bigger, or you might not have enough space. scchan1984 wrote: for(int i=1;i<=wcslen(input);i++) You are calling wcslen in each iteration, making it traverse the entire string in the search for the NULL terminator. It would make more sense to call wcslen outside of the loop, store its value in a variable, and use the variable in the loop. Then again, you don't even need to call wcslen, as the length of input is what MultiByteToWideChar returns; you'd just need to receive it in a variable. scchan1984 wrote: output[i] = ((input[i-1] & 0xFF) << 8) | ((input[i-1] & 0xFF00) >> 8); This seems to be a more readable version of the byte swap I wrote in my previous post, right? :) scchan1984 wrote: wcscat(output, (wchar_t*)"\x00\x00"); Here you are again traversing the entire string

                S 1 Reply Last reply
                0
                • S scchan1984

                  Thanks for everybody. I have constructed a method: char* UnicodeCharToBigEndianConverter(char* message) { wchar_t input[2000]; wchar_t output[2000]; MultiByteToWideChar( CP_ACP, // code page MB_COMPOSITE, // character-type options message, // string to map -1, // number of bytes in string input, // wide-character buffer strlen(message)+2 // size of buffer ); // escape character for unicode output[0]='\xFE\xFF'; MessageBoxW(NULL,input,L"Input",0); for(int i=1;i<=wcslen(input);i++) { output[i] = ((input[i-1] & 0xFF) << 8) | ((input[i-1] & 0xFF00) >> 8); } wcscat(output, (wchar_t*)"\x00\x00"); return (char*) output; } It works fine if the char* that I got from my application is a string with all unicode characters, e.g. all chinese characters. However, what if a user input a string with some chinese characters and some ASCII characters, e.g. A-Z, a-z? Is that whole string can be converted to big-endian using: output[i] = ((input[i-1] & 0xFF) << 8) | ((input[i-1] & 0xFF00) >> 8); Or we need to handle those ASCII characters?

                  S Offline
                  S Offline
                  scchan1984
                  wrote on last edited by
                  #8

                  I got the answer!!! The reason is that when I change the ASCII character to big-endian, take 'B', with unicode \x42\x00, it become \x00\x42. And I used strlen to check the string length of (char* UnicodeCharToBigEndianConverter(char* message)), which finds the terminating character \x00, and gives me the wrong string length!! My code in the above post should be ok. Anyone interested in have a look in it and test it.

                  1 Reply Last reply
                  0
                  • J Jose Lamas Rios

                    I hope you don't mind my comments :) scchan1984 wrote: wchar_t input[2000]; wchar_t output[2000]; MultiByteToWideChar( CP_ACP, // code page MB_COMPOSITE, // character-type options message, // string to map -1, // number of bytes in string input, // wide-character buffer strlen(message)+2 // size of buffer ); Those magic 2000 don't look well. If you are going to specify a maximum length of strings you support, you should at least check that the string you receive isn't longer than what you support. The last parameter to MultiByteToWideChar is wrong. There you are expected to pass the size of your buffer, which in this case is 2000. strlen() returns a char count of the string (not counting the null terminator), so for example, if you receive a message of 2100 chinese characters, strlen will return 4200. When you call to MultiByteToWideChar will try to write 2100 wide chars in a buffer that can only hold 2000, based in the erroneus information you gave it (you would be telling it that your buffer has space for 4202 wide chars instead of the actual 2000). The most likely result is a crash of your application. You should call MultiByteToWideChar first using 0 as the buffer size. That will return the required size for the buffer in wide chars. With that info, you can allocate a buffer in the heap, and then call MultiByteToWideChar again to do the conversion. scchan1984 wrote: // escape character for unicode output[0]='\xFE\xFF'; You defined both buffers with the same size, but if output will hold a byte oder marker, then it should be at least one wide char bigger, or you might not have enough space. scchan1984 wrote: for(int i=1;i<=wcslen(input);i++) You are calling wcslen in each iteration, making it traverse the entire string in the search for the NULL terminator. It would make more sense to call wcslen outside of the loop, store its value in a variable, and use the variable in the loop. Then again, you don't even need to call wcslen, as the length of input is what MultiByteToWideChar returns; you'd just need to receive it in a variable. scchan1984 wrote: output[i] = ((input[i-1] & 0xFF) << 8) | ((input[i-1] & 0xFF00) >> 8); This seems to be a more readable version of the byte swap I wrote in my previous post, right? :) scchan1984 wrote: wcscat(output, (wchar_t*)"\x00\x00"); Here you are again traversing the entire string

                    S Offline
                    S Offline
                    scchan1984
                    wrote on last edited by
                    #9

                    Really really thank you so much. I have to improve and test my code now :D

                    S 1 Reply Last reply
                    0
                    • S scchan1984

                      Really really thank you so much. I have to improve and test my code now :D

                      S Offline
                      S Offline
                      scchan1984
                      wrote on last edited by
                      #10

                      My improved version: WCHAR* UnicodeCharToBigEndianConverter(char* message) { // get the number of WCHAR from the message int nInputLen = MultiByteToWideChar( CP_ACP, // code page 0, // character-type options message, // string to map -1, // number of bytes in string NULL, // wide-character buffer NULL // size of buffer ); WCHAR* input = new WCHAR[nInputLen+1]; WCHAR* output = new WCHAR[nInputLen+1]; MultiByteToWideChar( CP_ACP, // code page MB_COMPOSITE, // character-type options message, // string to map -1, // number of bytes in string input, // wide-character buffer nInputLen+1 // size of buffer ); // escape character for unicode output[0]='\xFE\xFF'; for(int i=1;i<=nInputLen;i++) { output[i] = ((input[i-1] & 0xFF) << 8) | ((input[i-1] & 0xFF00) >> 8); //output[i] = input[i-1]; } output[nInputLen+1] = 0; return output; }

                      J 1 Reply Last reply
                      0
                      • S scchan1984

                        My improved version: WCHAR* UnicodeCharToBigEndianConverter(char* message) { // get the number of WCHAR from the message int nInputLen = MultiByteToWideChar( CP_ACP, // code page 0, // character-type options message, // string to map -1, // number of bytes in string NULL, // wide-character buffer NULL // size of buffer ); WCHAR* input = new WCHAR[nInputLen+1]; WCHAR* output = new WCHAR[nInputLen+1]; MultiByteToWideChar( CP_ACP, // code page MB_COMPOSITE, // character-type options message, // string to map -1, // number of bytes in string input, // wide-character buffer nInputLen+1 // size of buffer ); // escape character for unicode output[0]='\xFE\xFF'; for(int i=1;i<=nInputLen;i++) { output[i] = ((input[i-1] & 0xFF) << 8) | ((input[i-1] & 0xFF00) >> 8); //output[i] = input[i-1]; } output[nInputLen+1] = 0; return output; }

                        J Offline
                        J Offline
                        Jose Lamas Rios
                        wrote on last edited by
                        #11

                        It's much better now :) Note that you are allocating two buffers. You are returning one of them as the result. The other should be deleted before returning, to avoid leaking memory. Add the following line before the return: delete[] input; You should note that the callers of this function are responsible for deleting the result... Besides that, it seems to be already doing what it's supposed to do. Now, if you want to do it really right, keep reading... First I reccomend you read the following article by Joel Spolsky: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) [^] Now that you have read it (if not, go and read it, I'll wait right here), you surely understand why a name like UnicodeCharToBigEndian[...] doesn't make much sense for what this function does (i.e.: it's not receiving an Unicode string, but a multibyte string.) A better name would be MultiByteToBigEndianWideChar... One additional problem with your function is that it returns a buffer that must be released by the caller. This is a problematic approach because as a user of the function, I couldn't deduce that from the function signature alone. If all I see is a function declared in a header file as WCHAR* MultiByteToBigEndianWideChar(char* message); I have no way to know whether I should delete the returned buffer or not, or in case I'm expected to delete it, whether I should use delete[], free, or anything else. I'd have to rely on some documentation or on having access to the implementation itself in order to find out. To avoid this ambiguity, it's a common practice in this kind of functions to make the caller supply the buffer and its size as parameters to the function. When the buffer is supplied by the caller, the caller already knows if that buffer is allocated in the stack or in the heap, and how to release it in the latter case. That's exactly the approach followed by MultiByteToWideChar and all the API functions. In fact, you might start with a function with exactly the same parameters as MultiByteToWideChar

                        int MultiByteToBigEndianWideChar(
                          UINT CodePage, // code page
                          DWORD dwFlags, // character-type options
                          LPCSTR lpMultiByteStr, // string to map
                          int cbM

                        1 Reply Last reply
                        0
                        Reply
                        • Reply as topic
                        Log in to reply
                        • Oldest to Newest
                        • Newest to Oldest
                        • Most Votes


                        • Login

                        • Don't have an account? Register

                        • Login or register to search.
                        • First post
                          Last post
                        0
                        • Categories
                        • Recent
                        • Tags
                        • Popular
                        • World
                        • Users
                        • Groups