Microsoft Regex Weirdness

honey the codewitch

[[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[ \t\r\n]+ Runs over twice as slow as [[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[[:IsWhiteSpace:]]+ IsWhiteSpace is a Unicode charset. The [ \t\r\n] is simply ASCII. Why is this weird? Because there are a lot more characters in the IsWhiteSpace character set than 4 ([ \t\r\n]) which should make the transitions slower due to having to search several character ranges. Also 1552ms vs 629ms. That was what I found in my tests. My regex lib searches and matches the test input in about 4ms regardless of which expression is used.

Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

PIEBALDconsult

honey the codewitch wrote:

twice as slow

Please turn in your keyboard, you're done for the day. But seriously, are those COLONs supposed to be BRACEs? I've never used that syntax, and I'm trying to find it in the documentation.

k5054

I can't speak to MS RE's but in POSIX, character classes use [:space:], which can be used within a bracket (i.e. []) expression.

"A little song, a little dance, a little seltzer down your pants" Chuckles the clown

PIEBALDconsult

All I see in Character Classes in .NET Regular Expressions - .NET | Microsoft Learn[^] is \p{name} and \P{name} -- e.g. \p{IsCyrillic}. And I just learned about -- Character class subtraction: [base_group - [excluded_group]] -- which may be a new feature. E.g. [\p{IsBasicLatin}-[\x00-\x7F]] or [\u0000-\uFFFF-[\s\p{P}\p{IsGreek}\x85]] . I note that both of these require the \p{name} notation, so maybe what CW is testing isn't doing what she thinks.

Brisingr Aerowing

I just tested with .NET 8, and the :Whatever: syntax works perfectly fine.

What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???

PIEBALDconsult

OK, I've never seen it and I don't see it documented.

honey the codewitch

They are Unicode Character classes. They match the static methods on char in C#. They're supported on all major unicode regex engines i'm aware of as is [:characterclass:] on all posix engines

Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

Peter_in_2780

In the days of 7/8 bit chars, those class tests were often implemented in bitmaps (e.g. 8 classes in an array of 256 bytes) A similar trick in, say, UTF-16 wouldn't be outrageous these days.

Software rusts. Simon Stephenson, ca 1994. So does this signature. me, 2012

PIEBALDconsult

honey the codewitch wrote:

They match the static methods on char in C#.

Then I hope they don't call those static methods on each character as they go. P.S. Maybe try changing it to the [\p{name}] form and compare?

honey the codewitch

Probably not. I have a 688KB C# file with all of the supported character classes and codepoint ranges. Unicode is big. I imagine they have something similar. As far as [\p{name}] I am vague on that form of expression but isn't it Unicode? The unicode one is already faster. The curious bit is ascii. I suppose I could try [:alnum:]

Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

PIEBALDconsult

Right, as far as I can tell [\p{name}] == [:name:] . But do they perform the same? They should.

honey the codewitch

I'll find out when I get a chance. Just based on the way I parse this stuff I'm assuming it will be the same.

Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

Richard Deeming

Are you using the source generators[^]? That should let you dig into the actual regex code to see the difference. :)

"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

Brisingr Aerowing

OK. That's actually pretty cool.

What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???