Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. Other Discussions
  3. The Weird and The Wonderful
  4. Microsoft Regex Weirdness

Microsoft Regex Weirdness

Scheduled Pinned Locked Moved The Weird and The Wonderful
designvisual-studiocomgraphicsiot
14 Posts 6 Posters 15 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • H Offline
    H Offline
    honey the codewitch
    wrote on last edited by
    #1

    [[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[ \t\r\n]+ Runs over twice as slow as [[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[[:IsWhiteSpace:]]+ IsWhiteSpace is a Unicode charset. The [ \t\r\n] is simply ASCII. Why is this weird? Because there are a lot more characters in the IsWhiteSpace character set than 4 ([ \t\r\n]) which should make the transitions slower due to having to search several character ranges. Also 1552ms vs 629ms. That was what I found in my tests. My regex lib searches and matches the test input in about 4ms regardless of which expression is used.

    Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

    P P Richard DeemingR 3 Replies Last reply
    0
    • H honey the codewitch

      [[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[ \t\r\n]+ Runs over twice as slow as [[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[[:IsWhiteSpace:]]+ IsWhiteSpace is a Unicode charset. The [ \t\r\n] is simply ASCII. Why is this weird? Because there are a lot more characters in the IsWhiteSpace character set than 4 ([ \t\r\n]) which should make the transitions slower due to having to search several character ranges. Also 1552ms vs 629ms. That was what I found in my tests. My regex lib searches and matches the test input in about 4ms regardless of which expression is used.

      Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

      P Offline
      P Offline
      PIEBALDconsult
      wrote on last edited by
      #2

      honey the codewitch wrote:

      twice as slow

      Please turn in your keyboard, you're done for the day. But seriously, are those COLONs supposed to be BRACEs? I've never used that syntax, and I'm trying to find it in the documentation.

      K 1 Reply Last reply
      0
      • P PIEBALDconsult

        honey the codewitch wrote:

        twice as slow

        Please turn in your keyboard, you're done for the day. But seriously, are those COLONs supposed to be BRACEs? I've never used that syntax, and I'm trying to find it in the documentation.

        K Offline
        K Offline
        k5054
        wrote on last edited by
        #3

        I can't speak to MS RE's but in POSIX, character classes use [:space:], which can be used within a bracket (i.e. []) expression.

        "A little song, a little dance, a little seltzer down your pants" Chuckles the clown

        P 1 Reply Last reply
        0
        • K k5054

          I can't speak to MS RE's but in POSIX, character classes use [:space:], which can be used within a bracket (i.e. []) expression.

          "A little song, a little dance, a little seltzer down your pants" Chuckles the clown

          P Offline
          P Offline
          PIEBALDconsult
          wrote on last edited by
          #4

          All I see in Character Classes in .NET Regular Expressions - .NET | Microsoft Learn[^] is \p{name} and \P{name} -- e.g. \p{IsCyrillic}. And I just learned about -- Character class subtraction: [base_group - [excluded_group]] -- which may be a new feature. E.g. [\p{IsBasicLatin}-[\x00-\x7F]] or [\u0000-\uFFFF-[\s\p{P}\p{IsGreek}\x85]] . I note that both of these require the \p{name} notation, so maybe what CW is testing isn't doing what she thinks.

          B 1 Reply Last reply
          0
          • P PIEBALDconsult

            All I see in Character Classes in .NET Regular Expressions - .NET | Microsoft Learn[^] is \p{name} and \P{name} -- e.g. \p{IsCyrillic}. And I just learned about -- Character class subtraction: [base_group - [excluded_group]] -- which may be a new feature. E.g. [\p{IsBasicLatin}-[\x00-\x7F]] or [\u0000-\uFFFF-[\s\p{P}\p{IsGreek}\x85]] . I note that both of these require the \p{name} notation, so maybe what CW is testing isn't doing what she thinks.

            B Offline
            B Offline
            Brisingr Aerowing
            wrote on last edited by
            #5

            I just tested with .NET 8, and the :Whatever: syntax works perfectly fine.

            What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???

            P 1 Reply Last reply
            0
            • B Brisingr Aerowing

              I just tested with .NET 8, and the :Whatever: syntax works perfectly fine.

              What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???

              P Offline
              P Offline
              PIEBALDconsult
              wrote on last edited by
              #6

              OK, I've never seen it and I don't see it documented.

              H 1 Reply Last reply
              0
              • P PIEBALDconsult

                OK, I've never seen it and I don't see it documented.

                H Offline
                H Offline
                honey the codewitch
                wrote on last edited by
                #7

                They are Unicode Character classes. They match the static methods on char in C#. They're supported on all major unicode regex engines i'm aware of as is [:characterclass:] on all posix engines

                Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

                P 1 Reply Last reply
                0
                • H honey the codewitch

                  [[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[ \t\r\n]+ Runs over twice as slow as [[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[[:IsWhiteSpace:]]+ IsWhiteSpace is a Unicode charset. The [ \t\r\n] is simply ASCII. Why is this weird? Because there are a lot more characters in the IsWhiteSpace character set than 4 ([ \t\r\n]) which should make the transitions slower due to having to search several character ranges. Also 1552ms vs 629ms. That was what I found in my tests. My regex lib searches and matches the test input in about 4ms regardless of which expression is used.

                  Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

                  P Offline
                  P Offline
                  Peter_in_2780
                  wrote on last edited by
                  #8

                  In the days of 7/8 bit chars, those class tests were often implemented in bitmaps (e.g. 8 classes in an array of 256 bytes) A similar trick in, say, UTF-16 wouldn't be outrageous these days.

                  Software rusts. Simon Stephenson, ca 1994. So does this signature. me, 2012

                  1 Reply Last reply
                  0
                  • H honey the codewitch

                    They are Unicode Character classes. They match the static methods on char in C#. They're supported on all major unicode regex engines i'm aware of as is [:characterclass:] on all posix engines

                    Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

                    P Offline
                    P Offline
                    PIEBALDconsult
                    wrote on last edited by
                    #9

                    honey the codewitch wrote:

                    They match the static methods on char in C#.

                    Then I hope they don't call those static methods on each character as they go. P.S. Maybe try changing it to the [\p{name}] form and compare?

                    H 1 Reply Last reply
                    0
                    • P PIEBALDconsult

                      honey the codewitch wrote:

                      They match the static methods on char in C#.

                      Then I hope they don't call those static methods on each character as they go. P.S. Maybe try changing it to the [\p{name}] form and compare?

                      H Offline
                      H Offline
                      honey the codewitch
                      wrote on last edited by
                      #10

                      Probably not. I have a 688KB C# file with all of the supported character classes and codepoint ranges. Unicode is big. I imagine they have something similar. As far as [\p{name}] I am vague on that form of expression but isn't it Unicode? The unicode one is already faster. The curious bit is ascii. I suppose I could try [:alnum:]

                      Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

                      P 1 Reply Last reply
                      0
                      • H honey the codewitch

                        Probably not. I have a 688KB C# file with all of the supported character classes and codepoint ranges. Unicode is big. I imagine they have something similar. As far as [\p{name}] I am vague on that form of expression but isn't it Unicode? The unicode one is already faster. The curious bit is ascii. I suppose I could try [:alnum:]

                        Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

                        P Offline
                        P Offline
                        PIEBALDconsult
                        wrote on last edited by
                        #11

                        Right, as far as I can tell [\p{name}] == [:name:] . But do they perform the same? They should.

                        H 1 Reply Last reply
                        0
                        • P PIEBALDconsult

                          Right, as far as I can tell [\p{name}] == [:name:] . But do they perform the same? They should.

                          H Offline
                          H Offline
                          honey the codewitch
                          wrote on last edited by
                          #12

                          I'll find out when I get a chance. Just based on the way I parse this stuff I'm assuming it will be the same.

                          Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

                          1 Reply Last reply
                          0
                          • H honey the codewitch

                            [[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[ \t\r\n]+ Runs over twice as slow as [[:IsLetter:]_][[:IsLetterOrDigit:]_]*|0|-?[1-9][0-9]*(\\.[0-9]+([Ee]-?[1-9][0-9]*)?)?|[[:IsWhiteSpace:]]+ IsWhiteSpace is a Unicode charset. The [ \t\r\n] is simply ASCII. Why is this weird? Because there are a lot more characters in the IsWhiteSpace character set than 4 ([ \t\r\n]) which should make the transitions slower due to having to search several character ranges. Also 1552ms vs 629ms. That was what I found in my tests. My regex lib searches and matches the test input in about 4ms regardless of which expression is used.

                            Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

                            Richard DeemingR Offline
                            Richard DeemingR Offline
                            Richard Deeming
                            wrote on last edited by
                            #13

                            Are you using the source generators[^]? That should let you dig into the actual regex code to see the difference. :)


                            "These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

                            "These people looked deep within my soul and assigned me a number based on the order in which I joined" - Homer

                            B 1 Reply Last reply
                            0
                            • Richard DeemingR Richard Deeming

                              Are you using the source generators[^]? That should let you dig into the actual regex code to see the difference. :)


                              "These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

                              B Offline
                              B Offline
                              Brisingr Aerowing
                              wrote on last edited by
                              #14

                              OK. That's actually pretty cool.

                              What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???

                              1 Reply Last reply
                              0
                              Reply
                              • Reply as topic
                              Log in to reply
                              • Oldest to Newest
                              • Newest to Oldest
                              • Most Votes


                              • Login

                              • Don't have an account? Register

                              • Login or register to search.
                              • First post
                                Last post
                              0
                              • Categories
                              • Recent
                              • Tags
                              • Popular
                              • World
                              • Users
                              • Groups