wchar_t in C

jschell

Jochen Arndt wrote:

wchar_t are used to store fixed length Unicode characters like UTF-16 or UTF-32

Just noting that statement is somewhat of a generalization. For starters, unicode, for those bit sizes never represents all characters via single character. One needs to go to 128 bits for a full representation. Maybe that isn't even big enough. Additionally it is not limited to unicode. Although perhaps these days that would be the predominant usage in the western world.

Jochen Arndt

Quote:

"wchar_t are used to store fixed length Unicode characters like UTF-16 or UTF-32" Just noting that statement is somewhat of a generalization.

Why? As far as I know wchar_t is not used for variable length encodings.

jschell wrote:

for those bit sizes never represents all characters via single character

While that is true for UTF-16 it is not for UTF-32.

jschell wrote:

One needs to go to 128 bits for a full representation

That is wrong.

The Unicode Blog: Announcing The Unicode® Standard, Version 10.0[^]:

Tuesday, June 20, 2017 Version 10.0 of the Unicode Standard is now available. For the first time, both the core specification and the data files are available on the same date. Version 10.0 adds 8,518 characters, for a total of 136,690 characters. These additions include four new scripts, for a total of 139 scripts, as well as 56 new emoji characters.

Still 14 of 32 bits unused.

Anonygeeker

fixed length Unicode characters like UTF-16 or UTF-32-A detailed explanation will be helpful.And any example for them?

Jochen Arndt

wchar_t are used to store "wide characters" (characters using an encoding that requires more than a byte). The most common used character encodings for wchar_t are UCS-2 (a subset of UTF-16) and UTF-32. Read Unicode - Wikipedia[^].

Anonygeeker

Thanks..

jschell

Jochen Arndt wrote:

Why? As far as I know wchar_t is not used for variable length encodings.

It is intended for any character set, not just unicode. Most representations are not unicode. And unicode IS a variable length encoding to some meaning of that definition. There are escape characters in the 8/16/32 bit unicode character sets that allow for the definition of additional characters using multiple 'character' values. So two wchar_t might be needed for a single character.

Jochen Arndt wrote:

While that is true for UTF-16 it is not for UTF-32.

"UTF-32 does not make calculating the displayed width of a string easier, since even with a "fixed width" font there may be more than one code point per character position" UTF-32 - Wikipedia[^] I will state that it is unlikely for this to be used.

Jochen Arndt wrote:

That is wrong.

Presumably you are claiming that UTF-32 contains every possible character. So based on that logic what exactly is in UTF-64? Just UTF-32 for the first half and the empty space for the rest?

Jochen Arndt wrote:

Still 14 of 32 bits unused.

That isn't relevant. It isn't how the character set is defined but rather the extent and how it is used. There are unused spots in a number of places in unicode in general. No idea why. Perhaps they figure a specific range of the character set might have a few more characters added in the future. Far as I can recall many character sets beyond 7 bits end up duplicating or adding to a real character set. For example the normal extended ascii set has several dashes and a few mathematical symbols. And that is only using 8 bits.

Jochen Arndt

Quote:

It is intended for any character set, not just unicode. Most representations are not unicode.

Examples (I don't know one except when wchar_t is defined as char)?

Quote:

And unicode IS a variable length encoding to some meaning of that definition

There are multiple Unicode encodings where some are fixed length and some are variable length.

Quote:

So two wchar_t might be needed for a single character.

It is intended to be used for single characters. Allowing more than one requires a much more complex implementation (like with the char based Microsoft multi byte character sets).

Quote:

"UTF-32 does not make calculating the displayed width of a string easier, since even with a "fixed width" font there may be more than one code point per character position"

Fonts and there display length is not related to character encoding specifications.

Quote:

That isn't relevant. It isn't how the character set is defined but rather the extent and how it is used.

It is all about the definition regarding the required storage size. The unused code points are there because the codes are grouped (each script or symbol type has an assigned range). See Unicode block - Wikipedia[^]. So new characters / symbols can be added later to the belonging group (a rather old example is the Euro symbol). The grouping has been choosen because with 32 bits there is enough room. Unicode already contains nearly all known scripts including ancient ones like Runes and Mayan glyphs and a wide range of symbols.

jschell

Jochen Arndt wrote:

The grouping has been choosen because with 32 bits there is enough room. Unicode already contains nearly all known scripts including ancient ones like Runes and Mayan glyphs and a wide range of symbols.

As I said there is a 64 bit definition. Left to you to explain what the purpose of that is.

Jochen Arndt

I can't explain the purpose of something that does not exist. There is neither UTF-64 nor UTF-128.

jschell

Jochen Arndt wrote:

There is neither UTF-64 nor UTF-128.

I stand corrected - far as I can tell there is no 64 bit encoding. However there still remains code points in the 32 bit set that require a total of two code points.