Microsoft Regex Weirdness
-
All I see in Character Classes in .NET Regular Expressions - .NET | Microsoft Learn[^] is
\p{name}
and\P{name}
-- e.g.\p{IsCyrillic}
. And I just learned about -- Character class subtraction: [base_group - [excluded_group]] -- which may be a new feature. E.g.[\p{IsBasicLatin}-[\x00-\x7F]]
or[\u0000-\uFFFF-[\s\p{P}\p{IsGreek}\x85]]
. I note that both of these require the\p{name}
notation, so maybe what CW is testing isn't doing what she thinks.I just tested with .NET 8, and the
:Whatever:
syntax works perfectly fine.What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???
-
I just tested with .NET 8, and the
:Whatever:
syntax works perfectly fine.What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???
OK, I've never seen it and I don't see it documented.
-
OK, I've never seen it and I don't see it documented.
They are Unicode Character classes. They match the static methods on char in C#. They're supported on all major unicode regex engines i'm aware of as is [:characterclass:] on all posix engines
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
-
[[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[ \t\r\n]+
Runs over twice as slow as[[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[[:IsWhiteSpace:]]+
IsWhiteSpace is a Unicode charset. The [ \t\r\n] is simply ASCII. Why is this weird? Because there are a lot more characters in the IsWhiteSpace character set than 4 ([ \t\r\n]) which should make the transitions slower due to having to search several character ranges. Also 1552ms vs 629ms. That was what I found in my tests. My regex lib searches and matches the test input in about 4ms regardless of which expression is used.Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
In the days of 7/8 bit chars, those class tests were often implemented in bitmaps (e.g. 8 classes in an array of 256 bytes) A similar trick in, say, UTF-16 wouldn't be outrageous these days.
Software rusts. Simon Stephenson, ca 1994. So does this signature. me, 2012
-
They are Unicode Character classes. They match the static methods on char in C#. They're supported on all major unicode regex engines i'm aware of as is [:characterclass:] on all posix engines
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
honey the codewitch wrote:
They match the static methods on char in C#.
Then I hope they don't call those static methods on each character as they go. P.S. Maybe try changing it to the
[\p{name}]
form and compare? -
honey the codewitch wrote:
They match the static methods on char in C#.
Then I hope they don't call those static methods on each character as they go. P.S. Maybe try changing it to the
[\p{name}]
form and compare?Probably not. I have a 688KB C# file with all of the supported character classes and codepoint ranges. Unicode is big. I imagine they have something similar. As far as [\p{name}] I am vague on that form of expression but isn't it Unicode? The unicode one is already faster. The curious bit is ascii. I suppose I could try [:alnum:]
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
-
Probably not. I have a 688KB C# file with all of the supported character classes and codepoint ranges. Unicode is big. I imagine they have something similar. As far as [\p{name}] I am vague on that form of expression but isn't it Unicode? The unicode one is already faster. The curious bit is ascii. I suppose I could try [:alnum:]
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
Right, as far as I can tell
[\p{name}]
==[:name:]
. But do they perform the same? They should. -
Right, as far as I can tell
[\p{name}]
==[:name:]
. But do they perform the same? They should.I'll find out when I get a chance. Just based on the way I parse this stuff I'm assuming it will be the same.
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
-
[[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[ \t\r\n]+
Runs over twice as slow as[[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[[:IsWhiteSpace:]]+
IsWhiteSpace is a Unicode charset. The [ \t\r\n] is simply ASCII. Why is this weird? Because there are a lot more characters in the IsWhiteSpace character set than 4 ([ \t\r\n]) which should make the transitions slower due to having to search several character ranges. Also 1552ms vs 629ms. That was what I found in my tests. My regex lib searches and matches the test input in about 4ms regardless of which expression is used.Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
Are you using the source generators[^]? That should let you dig into the actual regex code to see the difference. :)
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer
-
Are you using the source generators[^]? That should let you dig into the actual regex code to see the difference. :)
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer
OK. That's actually pretty cool.
What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???