wchar_t in C
-
Hi, Is there a data type wchar_t in C?If so, how it differ from char?
-
Hi, Is there a data type wchar_t in C?If so, how it differ from char?
-
Hi, Is there a data type wchar_t in C?If so, how it differ from char?
Yes, since C90. See wchar_t - C++ Reference[^] and Wide character - Wikipedia[^]. Because it is implementation defined (compiler and platform dependant), there is no general answer to how it differs from a
char
. -
Yes, since C90. See wchar_t - C++ Reference[^] and Wide character - Wikipedia[^]. Because it is implementation defined (compiler and platform dependant), there is no general answer to how it differs from a
char
.Thanks. I tried finding size of it and got 4 bytes. If so, It should be able to take something like "abc". But its not happening. Why?
-
Thanks. I tried finding size of it and got 4 bytes. If so, It should be able to take something like "abc". But its not happening. Why?
Because "abc" is a
char*
string and not awchar_t
which represents a single character.wchar_t
are used to store fixed length Unicode characters like UTF-16 or UTF-32 whilechar
can store fixed length ASCII or ANSI characters (with an associated code page) or variable length characters like UTF-8 or Microsoft multi byte characters. -
Hi, Is there a data type wchar_t in C?If so, how it differ from char?
It's defined as a wide char for Unicode & UTF16 support primarily for filename name support (FAT32 LFN for example) and foreign console input. There is also another important type in which is wint_t which is the generic carrier form. You need the concept of narrowing which take a wide character back to it's byte approximation (see function wctob). wctob | Microsoft Docs[^] The reverse concept is widening which takes a byte character and promotes it (see function btowc) btowc | Microsoft Docs[^] The letter conversions are controlled by the current LC_TYPE locale meaning the language type Type something like this .. it prints the time in japanese :-)
#include
#include
#includeint main(void){
wchar_t str[100];
time_t t = time(0);
setlocale(LC_ALL, "ja-JP");
wcsftime(str, 100, L"%A %c", localtime(&t));
wprintf(L"%Ls\n", str);
}It will look something like "金曜日 2017/12/15 2:09:13"
In vino veritas
-
Because "abc" is a
char*
string and not awchar_t
which represents a single character.wchar_t
are used to store fixed length Unicode characters like UTF-16 or UTF-32 whilechar
can store fixed length ASCII or ANSI characters (with an associated code page) or variable length characters like UTF-8 or Microsoft multi byte characters.Jochen Arndt wrote:
wchar_t
are used to store fixed length Unicode characters like UTF-16 or UTF-32Just noting that statement is somewhat of a generalization. For starters, unicode, for those bit sizes never represents all characters via single character. One needs to go to 128 bits for a full representation. Maybe that isn't even big enough. Additionally it is not limited to unicode. Although perhaps these days that would be the predominant usage in the western world.
-
Jochen Arndt wrote:
wchar_t
are used to store fixed length Unicode characters like UTF-16 or UTF-32Just noting that statement is somewhat of a generalization. For starters, unicode, for those bit sizes never represents all characters via single character. One needs to go to 128 bits for a full representation. Maybe that isn't even big enough. Additionally it is not limited to unicode. Although perhaps these days that would be the predominant usage in the western world.
Quote:
"wchar_t are used to store fixed length Unicode characters like UTF-16 or UTF-32" Just noting that statement is somewhat of a generalization.
Why? As far as I know
wchar_t
is not used for variable length encodings.jschell wrote:
for those bit sizes never represents all characters via single character
While that is true for UTF-16 it is not for UTF-32.
jschell wrote:
One needs to go to 128 bits for a full representation
That is wrong.
The Unicode Blog: Announcing The Unicode® Standard, Version 10.0[^]:
Tuesday, June 20, 2017 Version 10.0 of the Unicode Standard is now available. For the first time, both the core specification and the data files are available on the same date. Version 10.0 adds 8,518 characters, for a total of 136,690 characters. These additions include four new scripts, for a total of 139 scripts, as well as 56 new emoji characters.
Still 14 of 32 bits unused.
-
Because "abc" is a
char*
string and not awchar_t
which represents a single character.wchar_t
are used to store fixed length Unicode characters like UTF-16 or UTF-32 whilechar
can store fixed length ASCII or ANSI characters (with an associated code page) or variable length characters like UTF-8 or Microsoft multi byte characters.fixed length Unicode characters like UTF-16 or UTF-32-A detailed explanation will be helpful.And any example for them?
-
fixed length Unicode characters like UTF-16 or UTF-32-A detailed explanation will be helpful.And any example for them?
wchar_t
are used to store "wide characters" (characters using an encoding that requires more than a byte). The most common used character encodings forwchar_t
are UCS-2 (a subset of UTF-16) and UTF-32. Read Unicode - Wikipedia[^]. -
wchar_t
are used to store "wide characters" (characters using an encoding that requires more than a byte). The most common used character encodings forwchar_t
are UCS-2 (a subset of UTF-16) and UTF-32. Read Unicode - Wikipedia[^].Thanks..
-
Quote:
"wchar_t are used to store fixed length Unicode characters like UTF-16 or UTF-32" Just noting that statement is somewhat of a generalization.
Why? As far as I know
wchar_t
is not used for variable length encodings.jschell wrote:
for those bit sizes never represents all characters via single character
While that is true for UTF-16 it is not for UTF-32.
jschell wrote:
One needs to go to 128 bits for a full representation
That is wrong.
The Unicode Blog: Announcing The Unicode® Standard, Version 10.0[^]:
Tuesday, June 20, 2017 Version 10.0 of the Unicode Standard is now available. For the first time, both the core specification and the data files are available on the same date. Version 10.0 adds 8,518 characters, for a total of 136,690 characters. These additions include four new scripts, for a total of 139 scripts, as well as 56 new emoji characters.
Still 14 of 32 bits unused.
Jochen Arndt wrote:
Why? As far as I know
wchar_t
is not used for variable length encodings.It is intended for any character set, not just unicode. Most representations are not unicode. And unicode IS a variable length encoding to some meaning of that definition. There are escape characters in the 8/16/32 bit unicode character sets that allow for the definition of additional characters using multiple 'character' values. So two wchar_t might be needed for a single character.
Jochen Arndt wrote:
While that is true for UTF-16 it is not for UTF-32.
"UTF-32 does not make calculating the displayed width of a string easier, since even with a "fixed width" font there may be more than one code point per character position" UTF-32 - Wikipedia[^] I will state that it is unlikely for this to be used.
Jochen Arndt wrote:
That is wrong.
Presumably you are claiming that UTF-32 contains every possible character. So based on that logic what exactly is in UTF-64? Just UTF-32 for the first half and the empty space for the rest?
Jochen Arndt wrote:
Still 14 of 32 bits unused.
That isn't relevant. It isn't how the character set is defined but rather the extent and how it is used. There are unused spots in a number of places in unicode in general. No idea why. Perhaps they figure a specific range of the character set might have a few more characters added in the future. Far as I can recall many character sets beyond 7 bits end up duplicating or adding to a real character set. For example the normal extended ascii set has several dashes and a few mathematical symbols. And that is only using 8 bits.
-
Jochen Arndt wrote:
Why? As far as I know
wchar_t
is not used for variable length encodings.It is intended for any character set, not just unicode. Most representations are not unicode. And unicode IS a variable length encoding to some meaning of that definition. There are escape characters in the 8/16/32 bit unicode character sets that allow for the definition of additional characters using multiple 'character' values. So two wchar_t might be needed for a single character.
Jochen Arndt wrote:
While that is true for UTF-16 it is not for UTF-32.
"UTF-32 does not make calculating the displayed width of a string easier, since even with a "fixed width" font there may be more than one code point per character position" UTF-32 - Wikipedia[^] I will state that it is unlikely for this to be used.
Jochen Arndt wrote:
That is wrong.
Presumably you are claiming that UTF-32 contains every possible character. So based on that logic what exactly is in UTF-64? Just UTF-32 for the first half and the empty space for the rest?
Jochen Arndt wrote:
Still 14 of 32 bits unused.
That isn't relevant. It isn't how the character set is defined but rather the extent and how it is used. There are unused spots in a number of places in unicode in general. No idea why. Perhaps they figure a specific range of the character set might have a few more characters added in the future. Far as I can recall many character sets beyond 7 bits end up duplicating or adding to a real character set. For example the normal extended ascii set has several dashes and a few mathematical symbols. And that is only using 8 bits.
Quote:
It is intended for any character set, not just unicode. Most representations are not unicode.
Examples (I don't know one except when
wchar_t
is defined aschar
)?Quote:
And unicode IS a variable length encoding to some meaning of that definition
There are multiple Unicode encodings where some are fixed length and some are variable length.
Quote:
So two wchar_t might be needed for a single character.
It is intended to be used for single characters. Allowing more than one requires a much more complex implementation (like with the
char
based Microsoft multi byte character sets).Quote:
"UTF-32 does not make calculating the displayed width of a string easier, since even with a "fixed width" font there may be more than one code point per character position"
Fonts and there display length is not related to character encoding specifications.
Quote:
That isn't relevant. It isn't how the character set is defined but rather the extent and how it is used.
It is all about the definition regarding the required storage size. The unused code points are there because the codes are grouped (each script or symbol type has an assigned range). See Unicode block - Wikipedia[^]. So new characters / symbols can be added later to the belonging group (a rather old example is the Euro symbol). The grouping has been choosen because with 32 bits there is enough room. Unicode already contains nearly all known scripts including ancient ones like Runes and Mayan glyphs and a wide range of symbols.
-
Quote:
It is intended for any character set, not just unicode. Most representations are not unicode.
Examples (I don't know one except when
wchar_t
is defined aschar
)?Quote:
And unicode IS a variable length encoding to some meaning of that definition
There are multiple Unicode encodings where some are fixed length and some are variable length.
Quote:
So two wchar_t might be needed for a single character.
It is intended to be used for single characters. Allowing more than one requires a much more complex implementation (like with the
char
based Microsoft multi byte character sets).Quote:
"UTF-32 does not make calculating the displayed width of a string easier, since even with a "fixed width" font there may be more than one code point per character position"
Fonts and there display length is not related to character encoding specifications.
Quote:
That isn't relevant. It isn't how the character set is defined but rather the extent and how it is used.
It is all about the definition regarding the required storage size. The unused code points are there because the codes are grouped (each script or symbol type has an assigned range). See Unicode block - Wikipedia[^]. So new characters / symbols can be added later to the belonging group (a rather old example is the Euro symbol). The grouping has been choosen because with 32 bits there is enough room. Unicode already contains nearly all known scripts including ancient ones like Runes and Mayan glyphs and a wide range of symbols.
Jochen Arndt wrote:
The grouping has been choosen because with 32 bits there is enough room. Unicode already contains nearly all known scripts including ancient ones like Runes and Mayan glyphs and a wide range of symbols.
As I said there is a 64 bit definition. Left to you to explain what the purpose of that is.
-
Jochen Arndt wrote:
The grouping has been choosen because with 32 bits there is enough room. Unicode already contains nearly all known scripts including ancient ones like Runes and Mayan glyphs and a wide range of symbols.
As I said there is a 64 bit definition. Left to you to explain what the purpose of that is.
I can't explain the purpose of something that does not exist. There is neither UTF-64 nor UTF-128.
-
I can't explain the purpose of something that does not exist. There is neither UTF-64 nor UTF-128.