wchar_t in C

Anonygeeker

Hi, Is there a data type wchar_t in C?If so, how it differ from char?

CPallini

Quote:

Is there a data type wchar_t in C?

Quote:

If so, how it differ from char?

It is compiler dependent.

Jochen Arndt

Yes, since C90. See wchar_t - C++ Reference[^] and Wide character - Wikipedia[^]. Because it is implementation defined (compiler and platform dependant), there is no general answer to how it differs from a char.

Anonygeeker

Thanks. I tried finding size of it and got 4 bytes. If so, It should be able to take something like "abc". But its not happening. Why?

Jochen Arndt

Because "abc" is a char* string and not a wchar_t which represents a single character. wchar_t are used to store fixed length Unicode characters like UTF-16 or UTF-32 while char can store fixed length ASCII or ANSI characters (with an associated code page) or variable length characters like UTF-8 or Microsoft multi byte characters.

leon de boer

It's defined as a wide char for Unicode & UTF16 support primarily for filename name support (FAT32 LFN for example) and foreign console input. There is also another important type in which is wint_t which is the generic carrier form. You need the concept of narrowing which take a wide character back to it's byte approximation (see function wctob). wctob | Microsoft Docs[^] The reverse concept is widening which takes a byte character and promotes it (see function btowc) btowc | Microsoft Docs[^] The letter conversions are controlled by the current LC_TYPE locale meaning the language type Type something like this .. it prints the time in japanese :-)

#include
#include
#include

int main(void){

wchar_t str[100];
time_t t = time(0);
setlocale(LC_ALL, "ja-JP");
wcsftime(str, 100, L"%A %c", localtime(&t));
wprintf(L"%Ls\n", str);
}

It will look something like "金曜日 2017/12/15 2:09:13"

In vino veritas

jschell

Jochen Arndt wrote:

wchar_t are used to store fixed length Unicode characters like UTF-16 or UTF-32

Just noting that statement is somewhat of a generalization. For starters, unicode, for those bit sizes never represents all characters via single character. One needs to go to 128 bits for a full representation. Maybe that isn't even big enough. Additionally it is not limited to unicode. Although perhaps these days that would be the predominant usage in the western world.

Jochen Arndt

Quote:

"wchar_t are used to store fixed length Unicode characters like UTF-16 or UTF-32" Just noting that statement is somewhat of a generalization.

Why? As far as I know wchar_t is not used for variable length encodings.

jschell wrote:

for those bit sizes never represents all characters via single character

While that is true for UTF-16 it is not for UTF-32.

jschell wrote:

One needs to go to 128 bits for a full representation

That is wrong.

The Unicode Blog: Announcing The Unicode® Standard, Version 10.0[^]:

Tuesday, June 20, 2017 Version 10.0 of the Unicode Standard is now available. For the first time, both the core specification and the data files are available on the same date. Version 10.0 adds 8,518 characters, for a total of 136,690 characters. These additions include four new scripts, for a total of 139 scripts, as well as 56 new emoji characters.

Still 14 of 32 bits unused.

Anonygeeker

fixed length Unicode characters like UTF-16 or UTF-32-A detailed explanation will be helpful.And any example for them?

Jochen Arndt

wchar_t are used to store "wide characters" (characters using an encoding that requires more than a byte). The most common used character encodings for wchar_t are UCS-2 (a subset of UTF-16) and UTF-32. Read Unicode - Wikipedia[^].

Anonygeeker

Thanks..

jschell

Jochen Arndt wrote:

Why? As far as I know wchar_t is not used for variable length encodings.

It is intended for any character set, not just unicode. Most representations are not unicode. And unicode IS a variable length encoding to some meaning of that definition. There are escape characters in the 8/16/32 bit unicode character sets that allow for the definition of additional characters using multiple 'character' values. So two wchar_t might be needed for a single character.

Jochen Arndt wrote:

While that is true for UTF-16 it is not for UTF-32.

"UTF-32 does not make calculating the displayed width of a string easier, since even with a "fixed width" font there may be more than one code point per character position" UTF-32 - Wikipedia[^] I will state that it is unlikely for this to be used.

Jochen Arndt wrote:

That is wrong.

Presumably you are claiming that UTF-32 contains every possible character. So based on that logic what exactly is in UTF-64? Just UTF-32 for the first half and the empty space for the rest?

Jochen Arndt wrote:

Still 14 of 32 bits unused.

That isn't relevant. It isn't how the character set is defined but rather the extent and how it is used. There are unused spots in a number of places in unicode in general. No idea why. Perhaps they figure a specific range of the character set might have a few more characters added in the future. Far as I can recall many character sets beyond 7 bits end up duplicating or adding to a real character set. For example the normal extended ascii set has several dashes and a few mathematical symbols. And that is only using 8 bits.

Jochen Arndt

Quote:

It is intended for any character set, not just unicode. Most representations are not unicode.

Examples (I don't know one except when wchar_t is defined as char)?

Quote:

And unicode IS a variable length encoding to some meaning of that definition

There are multiple Unicode encodings where some are fixed length and some are variable length.

Quote:

So two wchar_t might be needed for a single character.

It is intended to be used for single characters. Allowing more than one requires a much more complex implementation (like with the char based Microsoft multi byte character sets).

Quote:

"UTF-32 does not make calculating the displayed width of a string easier, since even with a "fixed width" font there may be more than one code point per character position"

Fonts and there display length is not related to character encoding specifications.

Quote:

That isn't relevant. It isn't how the character set is defined but rather the extent and how it is used.

It is all about the definition regarding the required storage size. The unused code points are there because the codes are grouped (each script or symbol type has an assigned range). See Unicode block - Wikipedia[^]. So new characters / symbols can be added later to the belonging group (a rather old example is the Euro symbol). The grouping has been choosen because with 32 bits there is enough room. Unicode already contains nearly all known scripts including ancient ones like Runes and Mayan glyphs and a wide range of symbols.

jschell

Jochen Arndt wrote:

The grouping has been choosen because with 32 bits there is enough room. Unicode already contains nearly all known scripts including ancient ones like Runes and Mayan glyphs and a wide range of symbols.

As I said there is a 64 bit definition. Left to you to explain what the purpose of that is.

Jochen Arndt

I can't explain the purpose of something that does not exist. There is neither UTF-64 nor UTF-128.

jschell

Jochen Arndt wrote:

There is neither UTF-64 nor UTF-128.

I stand corrected - far as I can tell there is no 64 bit encoding. However there still remains code points in the 32 bit set that require a total of two code points.