Microsoft Regex Weirdness
-
[[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[ \t\r\n]+
Runs over twice as slow as[[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[[:IsWhiteSpace:]]+
IsWhiteSpace is a Unicode charset. The [ \t\r\n] is simply ASCII. Why is this weird? Because there are a lot more characters in the IsWhiteSpace character set than 4 ([ \t\r\n]) which should make the transitions slower due to having to search several character ranges. Also 1552ms vs 629ms. That was what I found in my tests. My regex lib searches and matches the test input in about 4ms regardless of which expression is used.Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
-
[[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[ \t\r\n]+
Runs over twice as slow as[[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[[:IsWhiteSpace:]]+
IsWhiteSpace is a Unicode charset. The [ \t\r\n] is simply ASCII. Why is this weird? Because there are a lot more characters in the IsWhiteSpace character set than 4 ([ \t\r\n]) which should make the transitions slower due to having to search several character ranges. Also 1552ms vs 629ms. That was what I found in my tests. My regex lib searches and matches the test input in about 4ms regardless of which expression is used.Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
honey the codewitch wrote:
twice as slow
Please turn in your keyboard, you're done for the day. But seriously, are those COLONs supposed to be BRACEs? I've never used that syntax, and I'm trying to find it in the documentation.
-
honey the codewitch wrote:
twice as slow
Please turn in your keyboard, you're done for the day. But seriously, are those COLONs supposed to be BRACEs? I've never used that syntax, and I'm trying to find it in the documentation.
-
I can't speak to MS RE's but in POSIX, character classes use
[:space:]
, which can be used within a bracket (i.e.[]
) expression."A little song, a little dance, a little seltzer down your pants" Chuckles the clown
All I see in Character Classes in .NET Regular Expressions - .NET | Microsoft Learn[^] is
\p{name}
and\P{name}
-- e.g.\p{IsCyrillic}
. And I just learned about -- Character class subtraction: [base_group - [excluded_group]] -- which may be a new feature. E.g.[\p{IsBasicLatin}-[\x00-\x7F]]
or[\u0000-\uFFFF-[\s\p{P}\p{IsGreek}\x85]]
. I note that both of these require the\p{name}
notation, so maybe what CW is testing isn't doing what she thinks. -
All I see in Character Classes in .NET Regular Expressions - .NET | Microsoft Learn[^] is
\p{name}
and\P{name}
-- e.g.\p{IsCyrillic}
. And I just learned about -- Character class subtraction: [base_group - [excluded_group]] -- which may be a new feature. E.g.[\p{IsBasicLatin}-[\x00-\x7F]]
or[\u0000-\uFFFF-[\s\p{P}\p{IsGreek}\x85]]
. I note that both of these require the\p{name}
notation, so maybe what CW is testing isn't doing what she thinks.I just tested with .NET 8, and the
:Whatever:
syntax works perfectly fine.What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???
-
I just tested with .NET 8, and the
:Whatever:
syntax works perfectly fine.What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???
OK, I've never seen it and I don't see it documented.
-
OK, I've never seen it and I don't see it documented.
They are Unicode Character classes. They match the static methods on char in C#. They're supported on all major unicode regex engines i'm aware of as is [:characterclass:] on all posix engines
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
-
[[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[ \t\r\n]+
Runs over twice as slow as[[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[[:IsWhiteSpace:]]+
IsWhiteSpace is a Unicode charset. The [ \t\r\n] is simply ASCII. Why is this weird? Because there are a lot more characters in the IsWhiteSpace character set than 4 ([ \t\r\n]) which should make the transitions slower due to having to search several character ranges. Also 1552ms vs 629ms. That was what I found in my tests. My regex lib searches and matches the test input in about 4ms regardless of which expression is used.Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
In the days of 7/8 bit chars, those class tests were often implemented in bitmaps (e.g. 8 classes in an array of 256 bytes) A similar trick in, say, UTF-16 wouldn't be outrageous these days.
Software rusts. Simon Stephenson, ca 1994. So does this signature. me, 2012
-
They are Unicode Character classes. They match the static methods on char in C#. They're supported on all major unicode regex engines i'm aware of as is [:characterclass:] on all posix engines
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
honey the codewitch wrote:
They match the static methods on char in C#.
Then I hope they don't call those static methods on each character as they go. P.S. Maybe try changing it to the
[\p{name}]
form and compare? -
honey the codewitch wrote:
They match the static methods on char in C#.
Then I hope they don't call those static methods on each character as they go. P.S. Maybe try changing it to the
[\p{name}]
form and compare?Probably not. I have a 688KB C# file with all of the supported character classes and codepoint ranges. Unicode is big. I imagine they have something similar. As far as [\p{name}] I am vague on that form of expression but isn't it Unicode? The unicode one is already faster. The curious bit is ascii. I suppose I could try [:alnum:]
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
-
Probably not. I have a 688KB C# file with all of the supported character classes and codepoint ranges. Unicode is big. I imagine they have something similar. As far as [\p{name}] I am vague on that form of expression but isn't it Unicode? The unicode one is already faster. The curious bit is ascii. I suppose I could try [:alnum:]
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
Right, as far as I can tell
[\p{name}]
==[:name:]
. But do they perform the same? They should. -
Right, as far as I can tell
[\p{name}]
==[:name:]
. But do they perform the same? They should.I'll find out when I get a chance. Just based on the way I parse this stuff I'm assuming it will be the same.
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
-
[[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[ \t\r\n]+
Runs over twice as slow as[[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[[:IsWhiteSpace:]]+
IsWhiteSpace is a Unicode charset. The [ \t\r\n] is simply ASCII. Why is this weird? Because there are a lot more characters in the IsWhiteSpace character set than 4 ([ \t\r\n]) which should make the transitions slower due to having to search several character ranges. Also 1552ms vs 629ms. That was what I found in my tests. My regex lib searches and matches the test input in about 4ms regardless of which expression is used.Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
Are you using the source generators[^]? That should let you dig into the actual regex code to see the difference. :)
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer
-
Are you using the source generators[^]? That should let you dig into the actual regex code to see the difference. :)
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer
OK. That's actually pretty cool.
What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???