3-byte Unicode characters?

Anthony Appleyard

Windows functions with character arguments, currently are each in 2 modes, one for 1-byte characters (the old ascii characters from 0 to 255), and one for 2-byte characters (WCHAR etc) for the Unicode characters from 0 to 65535 (0x0000 to 0xFFFF). But there are Unicode characters defined that need to be in 3 bytes, e.g. 0x012000 to 0x01236E for cuneiform; I have already found a Wikipedia page that displays cuneiform characters, or would if I had a font for cuneiform. How do Windows C++ programs usually handle and read and write such exotica? Wikipedia page for "Cuneiform"

Lost User

Those characters fall into the Multi-byte Character Set (MBCS) types, and require fonts that can display them.

Anthony Appleyard

Are there any C++ functions to handle MBCS characters?

Lost User

What specifically are you trying to do?

jschell

First you figure out what you want to do. Second your figure out what character set or character sets (plural) you need to solve your problem. Third you determine what technology you need to solve that problem. At best you haven't identified the second part of the above. At least you haven't stated what character set you think you will be working with.

Anthony Appleyard

The printf function has a version that prints one-byte characters, and a version that prints two-byte characters. Similarly with many other Windows C++ functions. But if I want to print a cuneiform character to screen, that is a 3-byte Unicode character, am I advised to stick to one-character mode and myself make the byte sequence to make Unicode go into the mode for 3-byte characters, and then to send the 3-byte character as three one-byte characters?

jschell

Anthony Appleyard wrote:

that is a 3-byte Unicode character

You are confusing the conceptual with the practical. "Unicode" in its broadest sense is an attempt to regularize how characters are used in computing. It does that by defining characters. Those characters are then represented in character sets. There are quite a few of those (although less than the number of sets without the standardization of unicode.) Following are two examples of character sets. http://en.wikipedia.org/wiki/UTF-8[^] http://en.wikipedia.org/wiki/UTF-16[^] And those are just what is supposed to be in the data and doesn't say anything about whether any given technology X will support them partially much less fully. You seem to be suggesting that you might be attempting to use UTF16. However I am rather certain that there are variants of that.

Anthony Appleyard

jschell wrote:

You seem to be suggesting that you might be attempting to use UTF16. However I am rather certain that there are variants of that.

I have successfully read and printed and displayed on screen the 2-byte Unicode characters, in a C++ application called Typecase which I wrote, which is somewhat like Windows Character Map; it outputs by putting its text output in the clipboard. I have successfully output Unicode text to UTF16 mode files.

jschell

As I stated there are variants to UTF16. I am rather certain that one does not have any extensions at all. Another uses a two bytes (a range of two bytes) to specify that the following two bytes are used together (4 bytes) to create a code point. I believe there is a variant of UTF8 that can have a 3 byte character code point. But I am not as clear that there is a UTF16 that does.

Joe Woodbury

You are confusing UTF-8 and UTF-16. Both use variable length representation for characters, though with UTF-16, common languages are represented with two bytes, with things like musical notes and cuneiform in the escaped range. See the following: https://en.wikipedia.org/wiki/UTF-16[^]