Either I'm missing something or .NET is
-
So .NET uses 16bit unicode by default, and with the territory comes the possibility of 32 bit codes embedded as surrogate pairs. That's all well and good even if it complicates things. However, .NET provides no functions for operating on those 32-bit values. No
char.IsWhiteSpace()
. NoGetUnicodeCategory()
, nada. So what the hell are you supposed to do with these values? *angery*Real programmers use butterflies
-
So .NET uses 16bit unicode by default, and with the territory comes the possibility of 32 bit codes embedded as surrogate pairs. That's all well and good even if it complicates things. However, .NET provides no functions for operating on those 32-bit values. No
char.IsWhiteSpace()
. NoGetUnicodeCategory()
, nada. So what the hell are you supposed to do with these values? *angery*Real programmers use butterflies
-
So .NET uses 16bit unicode by default, and with the territory comes the possibility of 32 bit codes embedded as surrogate pairs. That's all well and good even if it complicates things. However, .NET provides no functions for operating on those 32-bit values. No
char.IsWhiteSpace()
. NoGetUnicodeCategory()
, nada. So what the hell are you supposed to do with these values? *angery*Real programmers use butterflies
I don't know enough about UTF-32, but what about: Char.IsHighSurrogate Method (System) | Microsoft Docs[^] Char.ConvertToUtf32 Method (System) | Microsoft Docs[^] Char.ConvertFromUtf32(Int32) Method (System) | Microsoft Docs[^] Or maybe Jon Skeet's article will have something useful: Unicode and .NET[^]
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer
-
So .NET uses 16bit unicode by default, and with the territory comes the possibility of 32 bit codes embedded as surrogate pairs. That's all well and good even if it complicates things. However, .NET provides no functions for operating on those 32-bit values. No
char.IsWhiteSpace()
. NoGetUnicodeCategory()
, nada. So what the hell are you supposed to do with these values? *angery*Real programmers use butterflies
Um ... Char.IsWhiteSpace Method (System) | Microsoft Docs[^] and Char.GetUnicodeCategory Method (System) | Microsoft Docs[^] Or am I missing something?
"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony AntiTwitter: @DalekDave is now a follower!
-
So .NET uses 16bit unicode by default, and with the territory comes the possibility of 32 bit codes embedded as surrogate pairs. That's all well and good even if it complicates things. However, .NET provides no functions for operating on those 32-bit values. No
char.IsWhiteSpace()
. NoGetUnicodeCategory()
, nada. So what the hell are you supposed to do with these values? *angery*Real programmers use butterflies
-
Um ... Char.IsWhiteSpace Method (System) | Microsoft Docs[^] and Char.GetUnicodeCategory Method (System) | Microsoft Docs[^] Or am I missing something?
"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony AntiTwitter: @DalekDave is now a follower!
Only works for 16 bit unicode values. Not these 32bit surrogates
Real programmers use butterflies
-
I don't know enough about UTF-32, but what about: Char.IsHighSurrogate Method (System) | Microsoft Docs[^] Char.ConvertToUtf32 Method (System) | Microsoft Docs[^] Char.ConvertFromUtf32(Int32) Method (System) | Microsoft Docs[^] Or maybe Jon Skeet's article will have something useful: Unicode and .NET[^]
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer
Okay yes, but once i have that converted (either to a 32bit int, or a double-char string, i can't do anything with it. I can't call char.IsWhiteSpace with it. I can't do anything but print its value. Which is stupid
Real programmers use butterflies
-
Okay yes, but once i have that converted (either to a 32bit int, or a double-char string, i can't do anything with it. I can't call char.IsWhiteSpace with it. I can't do anything but print its value. Which is stupid
Real programmers use butterflies
ASCII FTW! 127 characters should be enough for anyone! :-D I'm not sure there are any whitespace characters that would be encoded as a surrogate pair: Whitespace character - Wikipedia[^]
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer
-
So .NET uses 16bit unicode by default, and with the territory comes the possibility of 32 bit codes embedded as surrogate pairs. That's all well and good even if it complicates things. However, .NET provides no functions for operating on those 32-bit values. No
char.IsWhiteSpace()
. NoGetUnicodeCategory()
, nada. So what the hell are you supposed to do with these values? *angery*Real programmers use butterflies
"So what the hell are you supposed to do with these values" hope they never appear.
-
ASCII FTW! 127 characters should be enough for anyone! :-D I'm not sure there are any whitespace characters that would be encoded as a surrogate pair: Whitespace character - Wikipedia[^]
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer
And EBCDIC - everybody always forgets EBCDIC :sigh:
-
"So what the hell are you supposed to do with these values" hope they never appear.
Pretty much! :thumbsup:
Real programmers use butterflies
-
ASCII FTW! 127 characters should be enough for anyone! :-D I'm not sure there are any whitespace characters that would be encoded as a surrogate pair: Whitespace character - Wikipedia[^]
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer
I've encountered a weird one. But I don't remember if it was a surrogate or not. I just remember it didn't print to the console properly (it cooked it) or didn't save to source as a literal value or something. It came up as whitespace, but only when I used these huge "not ranges" which are like [^a-z] (anything but a lower case letter) that "anything" part created ranges all throughout the 16-bit unicode spectrum. And that's when I ran into issues with one whitespace character.
Real programmers use butterflies
-
And EBCDIC - everybody always forgets EBCDIC :sigh:
-
ASCII FTW! 127 characters should be enough for anyone! :-D I'm not sure there are any whitespace characters that would be encoded as a surrogate pair: Whitespace character - Wikipedia[^]
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer
Actually Baudot it sufficient at 5 bits. CQ de W5ALT
Walt Fair, Jr.PhD P. E. Comport Computing Specializing in Technical Engineering Software
-
So .NET uses 16bit unicode by default, and with the territory comes the possibility of 32 bit codes embedded as surrogate pairs. That's all well and good even if it complicates things. However, .NET provides no functions for operating on those 32-bit values. No
char.IsWhiteSpace()
. NoGetUnicodeCategory()
, nada. So what the hell are you supposed to do with these values? *angery*Real programmers use butterflies
idea: research using Unicode categories in your RegEx ? [^]
«One day it will have to be officially admitted that what we have christened reality is an even greater illusion than the world of dreams.» Salvador Dali
-
idea: research using Unicode categories in your RegEx ? [^]
«One day it will have to be officially admitted that what we have christened reality is an even greater illusion than the world of dreams.» Salvador Dali
I am using those. The category for surrogates is surrogate. Not helpful. Combining a hi and lo surrogate you get a 2 char string. The 2 char string cannot be queried for its unicode category in .NET AFAIK
Real programmers use butterflies
-
I am using those. The category for surrogates is surrogate. Not helpful. Combining a hi and lo surrogate you get a 2 char string. The 2 char string cannot be queried for its unicode category in .NET AFAIK
Real programmers use butterflies
honey the codewitch wrote:
The 2 char string cannot be queried for its unicode category in .NET AFAIK
It is a mess, but, check this against what you expect, now:
public void PrintUniCodeRange(int sc, int ec)
{
bool isKey;string key = ""; for (int i = sc; i <= ec; i++) { string ucString = char.ConvertFromUtf32(i); isKey = i < 256; if (isKey) key = ((Keys)Enum.Parse(typeof(Keys), i.ToString())).ToString(); UnicodeCategory cat = Char.GetUnicodeCategory(ucString, 0); if (cat != UnicodeCategory.OtherNotAssigned) { Console.WriteLine($"#{i} | Unicode Category: {cat} {(isKey ? "! Keys Enum: " + key : "")}"); } }
}
Calling the above with 8192 to 8233 parameters:
#8192 | Unicode Category: SpaceSeparator
#8193 | Unicode Category: SpaceSeparator
#8194 | Unicode Category: SpaceSeparator
#8195 | Unicode Category: SpaceSeparator
#8196 | Unicode Category: SpaceSeparator
#8197 | Unicode Category: SpaceSeparator
#8198 | Unicode Category: SpaceSeparator
#8199 | Unicode Category: SpaceSeparator
#8200 | Unicode Category: SpaceSeparator
#8201 | Unicode Category: SpaceSeparator
#8202 | Unicode Category: SpaceSeparator
#8203 | Unicode Category: Format
#8204 | Unicode Category: Format
#8205 | Unicode Category: Format
#8206 | Unicode Category: Format
#8207 | Unicode Category: Format
#8208 | Unicode Category: DashPunctuation
#8209 | Unicode Category: DashPunctuation
#8210 | Unicode Category: DashPunctuation
#8211 | Unicode Category: DashPunctuation
#8212 | Unicode Category: DashPunctuation
#8213 | Unicode Category: DashPunctuation
#8214 | Unicode Category: OtherPunctuation
#8215 | Unicode Category: OtherPunctuation
#8216 | Unicode Category: InitialQuotePunctuation
#8217 | Unicode Category: FinalQuotePunctuation
#8218 | Unicode Category: OpenPunctuation
#8219 | Unicode Category: InitialQuotePunctuation
#8220 | Unicode Category: InitialQuotePunctuation
#8221 | Unicode Category: FinalQuotePunctuation
#8222 | Unicode Category: OpenPunctuation
#8223 | Unicode Category: InitialQuotePunctuation
#8224 | Unicode Category: OtherPunctuation
#8225 | Unicode Category: OtherPunctuation
#8226 | Unicode Category: OtherPunctuation
#8227 | Unicode Category: OtherPunctuation
#8228 | Unicode Category: OtherPunctuation
#8229 | Unicode Category: OtherPunctuation
#8230 | Unicode Category: Othe -
honey the codewitch wrote:
The 2 char string cannot be queried for its unicode category in .NET AFAIK
It is a mess, but, check this against what you expect, now:
public void PrintUniCodeRange(int sc, int ec)
{
bool isKey;string key = ""; for (int i = sc; i <= ec; i++) { string ucString = char.ConvertFromUtf32(i); isKey = i < 256; if (isKey) key = ((Keys)Enum.Parse(typeof(Keys), i.ToString())).ToString(); UnicodeCategory cat = Char.GetUnicodeCategory(ucString, 0); if (cat != UnicodeCategory.OtherNotAssigned) { Console.WriteLine($"#{i} | Unicode Category: {cat} {(isKey ? "! Keys Enum: " + key : "")}"); } }
}
Calling the above with 8192 to 8233 parameters:
#8192 | Unicode Category: SpaceSeparator
#8193 | Unicode Category: SpaceSeparator
#8194 | Unicode Category: SpaceSeparator
#8195 | Unicode Category: SpaceSeparator
#8196 | Unicode Category: SpaceSeparator
#8197 | Unicode Category: SpaceSeparator
#8198 | Unicode Category: SpaceSeparator
#8199 | Unicode Category: SpaceSeparator
#8200 | Unicode Category: SpaceSeparator
#8201 | Unicode Category: SpaceSeparator
#8202 | Unicode Category: SpaceSeparator
#8203 | Unicode Category: Format
#8204 | Unicode Category: Format
#8205 | Unicode Category: Format
#8206 | Unicode Category: Format
#8207 | Unicode Category: Format
#8208 | Unicode Category: DashPunctuation
#8209 | Unicode Category: DashPunctuation
#8210 | Unicode Category: DashPunctuation
#8211 | Unicode Category: DashPunctuation
#8212 | Unicode Category: DashPunctuation
#8213 | Unicode Category: DashPunctuation
#8214 | Unicode Category: OtherPunctuation
#8215 | Unicode Category: OtherPunctuation
#8216 | Unicode Category: InitialQuotePunctuation
#8217 | Unicode Category: FinalQuotePunctuation
#8218 | Unicode Category: OpenPunctuation
#8219 | Unicode Category: InitialQuotePunctuation
#8220 | Unicode Category: InitialQuotePunctuation
#8221 | Unicode Category: FinalQuotePunctuation
#8222 | Unicode Category: OpenPunctuation
#8223 | Unicode Category: InitialQuotePunctuation
#8224 | Unicode Category: OtherPunctuation
#8225 | Unicode Category: OtherPunctuation
#8226 | Unicode Category: OtherPunctuation
#8227 | Unicode Category: OtherPunctuation
#8228 | Unicode Category: OtherPunctuation
#8229 | Unicode Category: OtherPunctuation
#8230 | Unicode Category: Othehmm, I wonder what my test was doing wrong, because GetUnicodeCategory(string, int) was returning only single char values for me i thought. maybe i had a bug
Real programmers use butterflies
-
honey the codewitch wrote:
The 2 char string cannot be queried for its unicode category in .NET AFAIK
It is a mess, but, check this against what you expect, now:
public void PrintUniCodeRange(int sc, int ec)
{
bool isKey;string key = ""; for (int i = sc; i <= ec; i++) { string ucString = char.ConvertFromUtf32(i); isKey = i < 256; if (isKey) key = ((Keys)Enum.Parse(typeof(Keys), i.ToString())).ToString(); UnicodeCategory cat = Char.GetUnicodeCategory(ucString, 0); if (cat != UnicodeCategory.OtherNotAssigned) { Console.WriteLine($"#{i} | Unicode Category: {cat} {(isKey ? "! Keys Enum: " + key : "")}"); } }
}
Calling the above with 8192 to 8233 parameters:
#8192 | Unicode Category: SpaceSeparator
#8193 | Unicode Category: SpaceSeparator
#8194 | Unicode Category: SpaceSeparator
#8195 | Unicode Category: SpaceSeparator
#8196 | Unicode Category: SpaceSeparator
#8197 | Unicode Category: SpaceSeparator
#8198 | Unicode Category: SpaceSeparator
#8199 | Unicode Category: SpaceSeparator
#8200 | Unicode Category: SpaceSeparator
#8201 | Unicode Category: SpaceSeparator
#8202 | Unicode Category: SpaceSeparator
#8203 | Unicode Category: Format
#8204 | Unicode Category: Format
#8205 | Unicode Category: Format
#8206 | Unicode Category: Format
#8207 | Unicode Category: Format
#8208 | Unicode Category: DashPunctuation
#8209 | Unicode Category: DashPunctuation
#8210 | Unicode Category: DashPunctuation
#8211 | Unicode Category: DashPunctuation
#8212 | Unicode Category: DashPunctuation
#8213 | Unicode Category: DashPunctuation
#8214 | Unicode Category: OtherPunctuation
#8215 | Unicode Category: OtherPunctuation
#8216 | Unicode Category: InitialQuotePunctuation
#8217 | Unicode Category: FinalQuotePunctuation
#8218 | Unicode Category: OpenPunctuation
#8219 | Unicode Category: InitialQuotePunctuation
#8220 | Unicode Category: InitialQuotePunctuation
#8221 | Unicode Category: FinalQuotePunctuation
#8222 | Unicode Category: OpenPunctuation
#8223 | Unicode Category: InitialQuotePunctuation
#8224 | Unicode Category: OtherPunctuation
#8225 | Unicode Category: OtherPunctuation
#8226 | Unicode Category: OtherPunctuation
#8227 | Unicode Category: OtherPunctuation
#8228 | Unicode Category: OtherPunctuation
#8229 | Unicode Category: OtherPunctuation
#8230 | Unicode Category: OtheThank you! Turns out there was a bug in my code where i wasn't passing doublechar strings in. They ended up single char.
Real programmers use butterflies
-
And EBCDIC - everybody always forgets EBCDIC :sigh:
Will someone please think about the children. [https://tenor.com/FJmS.gif\](https://tenor.com/FJmS.gif)