Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. Other Discussions
  3. The Weird and The Wonderful
  4. Microsoft Regex Weirdness

Microsoft Regex Weirdness

Scheduled Pinned Locked Moved The Weird and The Wonderful
designvisual-studiocomgraphicsiot
14 Posts 6 Posters 15 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • P PIEBALDconsult

    All I see in Character Classes in .NET Regular Expressions - .NET | Microsoft Learn[^] is \p{name} and \P{name} -- e.g. \p{IsCyrillic}. And I just learned about -- Character class subtraction: [base_group - [excluded_group]] -- which may be a new feature. E.g. [\p{IsBasicLatin}-[\x00-\x7F]] or [\u0000-\uFFFF-[\s\p{P}\p{IsGreek}\x85]] . I note that both of these require the \p{name} notation, so maybe what CW is testing isn't doing what she thinks.

    B Offline
    B Offline
    Brisingr Aerowing
    wrote on last edited by
    #5

    I just tested with .NET 8, and the :Whatever: syntax works perfectly fine.

    What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???

    P 1 Reply Last reply
    0
    • B Brisingr Aerowing

      I just tested with .NET 8, and the :Whatever: syntax works perfectly fine.

      What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???

      P Offline
      P Offline
      PIEBALDconsult
      wrote on last edited by
      #6

      OK, I've never seen it and I don't see it documented.

      H 1 Reply Last reply
      0
      • P PIEBALDconsult

        OK, I've never seen it and I don't see it documented.

        H Offline
        H Offline
        honey the codewitch
        wrote on last edited by
        #7

        They are Unicode Character classes. They match the static methods on char in C#. They're supported on all major unicode regex engines i'm aware of as is [:characterclass:] on all posix engines

        Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

        P 1 Reply Last reply
        0
        • H honey the codewitch

          [[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[ \t\r\n]+ Runs over twice as slow as [[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[[:IsWhiteSpace:]]+ IsWhiteSpace is a Unicode charset. The [ \t\r\n] is simply ASCII. Why is this weird? Because there are a lot more characters in the IsWhiteSpace character set than 4 ([ \t\r\n]) which should make the transitions slower due to having to search several character ranges. Also 1552ms vs 629ms. That was what I found in my tests. My regex lib searches and matches the test input in about 4ms regardless of which expression is used.

          Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

          P Offline
          P Offline
          Peter_in_2780
          wrote on last edited by
          #8

          In the days of 7/8 bit chars, those class tests were often implemented in bitmaps (e.g. 8 classes in an array of 256 bytes) A similar trick in, say, UTF-16 wouldn't be outrageous these days.

          Software rusts. Simon Stephenson, ca 1994. So does this signature. me, 2012

          1 Reply Last reply
          0
          • H honey the codewitch

            They are Unicode Character classes. They match the static methods on char in C#. They're supported on all major unicode regex engines i'm aware of as is [:characterclass:] on all posix engines

            Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

            P Offline
            P Offline
            PIEBALDconsult
            wrote on last edited by
            #9

            honey the codewitch wrote:

            They match the static methods on char in C#.

            Then I hope they don't call those static methods on each character as they go. P.S. Maybe try changing it to the [\p{name}] form and compare?

            H 1 Reply Last reply
            0
            • P PIEBALDconsult

              honey the codewitch wrote:

              They match the static methods on char in C#.

              Then I hope they don't call those static methods on each character as they go. P.S. Maybe try changing it to the [\p{name}] form and compare?

              H Offline
              H Offline
              honey the codewitch
              wrote on last edited by
              #10

              Probably not. I have a 688KB C# file with all of the supported character classes and codepoint ranges. Unicode is big. I imagine they have something similar. As far as [\p{name}] I am vague on that form of expression but isn't it Unicode? The unicode one is already faster. The curious bit is ascii. I suppose I could try [:alnum:]

              Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

              P 1 Reply Last reply
              0
              • H honey the codewitch

                Probably not. I have a 688KB C# file with all of the supported character classes and codepoint ranges. Unicode is big. I imagine they have something similar. As far as [\p{name}] I am vague on that form of expression but isn't it Unicode? The unicode one is already faster. The curious bit is ascii. I suppose I could try [:alnum:]

                Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

                P Offline
                P Offline
                PIEBALDconsult
                wrote on last edited by
                #11

                Right, as far as I can tell [\p{name}] == [:name:] . But do they perform the same? They should.

                H 1 Reply Last reply
                0
                • P PIEBALDconsult

                  Right, as far as I can tell [\p{name}] == [:name:] . But do they perform the same? They should.

                  H Offline
                  H Offline
                  honey the codewitch
                  wrote on last edited by
                  #12

                  I'll find out when I get a chance. Just based on the way I parse this stuff I'm assuming it will be the same.

                  Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

                  1 Reply Last reply
                  0
                  • H honey the codewitch

                    [[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[ \t\r\n]+ Runs over twice as slow as [[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[[:IsWhiteSpace:]]+ IsWhiteSpace is a Unicode charset. The [ \t\r\n] is simply ASCII. Why is this weird? Because there are a lot more characters in the IsWhiteSpace character set than 4 ([ \t\r\n]) which should make the transitions slower due to having to search several character ranges. Also 1552ms vs 629ms. That was what I found in my tests. My regex lib searches and matches the test input in about 4ms regardless of which expression is used.

                    Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

                    Richard DeemingR Offline
                    Richard DeemingR Offline
                    Richard Deeming
                    wrote on last edited by
                    #13

                    Are you using the source generators[^]? That should let you dig into the actual regex code to see the difference. :)


                    "These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

                    "These people looked deep within my soul and assigned me a number based on the order in which I joined" - Homer

                    B 1 Reply Last reply
                    0
                    • Richard DeemingR Richard Deeming

                      Are you using the source generators[^]? That should let you dig into the actual regex code to see the difference. :)


                      "These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

                      B Offline
                      B Offline
                      Brisingr Aerowing
                      wrote on last edited by
                      #14

                      OK. That's actually pretty cool.

                      What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???

                      1 Reply Last reply
                      0
                      Reply
                      • Reply as topic
                      Log in to reply
                      • Oldest to Newest
                      • Newest to Oldest
                      • Most Votes


                      • Login

                      • Don't have an account? Register

                      • Login or register to search.
                      • First post
                        Last post
                      0
                      • Categories
                      • Recent
                      • Tags
                      • Popular
                      • World
                      • Users
                      • Groups