Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. The Lounge
  3. I did a thing. It's neat.

I did a thing. It's neat.

Scheduled Pinned Locked Moved The Lounge
csharpdesignregexdotnetcom
15 Posts 7 Posters 1 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • P PIEBALDconsult

    honey the codewitch wrote:

    works like .NET's does

    And how is that? I use the Compiled option when I make a RegEx I know I'll use a bunch of times, but I don't know what that actually does.

    H Offline
    H Offline
    honey the codewitch
    wrote on last edited by
    #4

    Normally Regular Expressions (using Microsoft's or my engine) are interpreted at runtime. That is, the engine turns the expression into a state machine graph and traverses that at runtime to match text. When you compile it, it is no longer interpreted at runtime. Instead, the state traversal for that expression is hardcoded into machine language instructions, and optimized in the process. The upshot is longer startup times but faster execution. However, my engine allows you to operate on state machines that are packed into a simple flat integer array. (int[]), and runs nearly as fast as the compiled version, or faster in some cases. Pick your poison.

    Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

    P 1 Reply Last reply
    0
    • H honey the codewitch

      I needed a project to ward of boredom in the wee hours. I decided to do something I can't do - Reflection Emit from C# I built a Compile() feature into my regular expression engine that works like .NET's does. After wrestling with it for a little over a day, pay dirt! Gosh, IL is weird, but pretty simple once you get the hang of it. The VM is a very basic stack machine. It's extremely minimalist compared to modern CISC CPUs. Anyway, now I need to figure out how to write an article around it once I'm done basking in my accomplishment. :laugh: Edit: I wrote the article: Compiling DFA Regular Expressions into .NET Assemblies[^] I'm pretty proud of this code. Edit 2: Updated the timing to remove the time for console writing and I'm blown away.

      Microsoft Regex "Lexer": [■■■■■■■■■■] 100% Done in 1556ms
      Microsoft Regex Compiled "Lexer": [■■■■■■■■■■] 100% Done in 1186ms
      Expanded NFA Lexer: [■■■■■■■■■■] 100% Done in 109ms
      Compacted NFA Lexer: [■■■■■■■■■■] 100% Done in 100ms
      Unoptimized DFA Lexer: [■■■■■■■■■■] 100% Done in 111ms
      Optimized DFA Lexer: [■■■■■■■■■■] 100% Done in 6ms
      Table based DFA Lexer: [■■■■■■■■■■] 100% Done in 4ms
      Compiled DFA Lexer: [■■■■■■■■■■] 100% Done in 5ms

      Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

      S Offline
      S Offline
      Southmountain
      wrote on last edited by
      #5

      :thumbsup::thumbsup::thumbsup:

      diligent hands rule....

      1 Reply Last reply
      0
      • H honey the codewitch

        Normally Regular Expressions (using Microsoft's or my engine) are interpreted at runtime. That is, the engine turns the expression into a state machine graph and traverses that at runtime to match text. When you compile it, it is no longer interpreted at runtime. Instead, the state traversal for that expression is hardcoded into machine language instructions, and optimized in the process. The upshot is longer startup times but faster execution. However, my engine allows you to operate on state machines that are packed into a simple flat integer array. (int[]), and runs nearly as fast as the compiled version, or faster in some cases. Pick your poison.

        Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

        P Offline
        P Offline
        PIEBALDconsult
        wrote on last edited by
        #6

        My main thing is that I don't know what the Compiled Option actually does. I have an idea what it might or should do, but then it doesn't seem to make much sense not to include it every time. The "Compiled" option can't instruct the compiler to make IL of the expression at compile time, it has to be at runtime. I can definitely understand that "compiling" can help when a RegEx is used many times, but does it slow down a RegEx which is used only once or a few times? Is there some number of calls where performance matches (for a particular expression)? I also found that (apparently) the RegEx class caches the expressions and I suspect that it must "compile" them as well, otherwise why bother? I just read this: https://stackoverflow.com/questions/513412/how-does-regexoptions-compiled-work[^] Which I don't consider canon, though it may be correct. So Compiled makes it IL at some cost -- great, perfect. But non-Compiled must do something (at lower cost) as well which can be cached -- "turns the expression into a state machine graph" or similar? Might it then be kinda: 0) Take Expression in as text 1) Create state machine graph of the Expression and cache it 2) If requested, compile the state machine graph into IL -- and maybe cache that as well? There is also the question of performance between the static RegEx methods and an instance of a RegEx. Using an instance should at least eliminate the cost of a cache look-up. Part of why I ask is that I have written a number of methods which use Regular Expressions, and some of these get used by CLR Functions in SQL Server. I therefore want them to run as efficiently as they can as they may be called several million times during a query. Also, in your response, I suspect that when you say "runtime" you may mean "on each call", not just "on the first call".

        H J 3 Replies Last reply
        0
        • P PIEBALDconsult

          My main thing is that I don't know what the Compiled Option actually does. I have an idea what it might or should do, but then it doesn't seem to make much sense not to include it every time. The "Compiled" option can't instruct the compiler to make IL of the expression at compile time, it has to be at runtime. I can definitely understand that "compiling" can help when a RegEx is used many times, but does it slow down a RegEx which is used only once or a few times? Is there some number of calls where performance matches (for a particular expression)? I also found that (apparently) the RegEx class caches the expressions and I suspect that it must "compile" them as well, otherwise why bother? I just read this: https://stackoverflow.com/questions/513412/how-does-regexoptions-compiled-work[^] Which I don't consider canon, though it may be correct. So Compiled makes it IL at some cost -- great, perfect. But non-Compiled must do something (at lower cost) as well which can be cached -- "turns the expression into a state machine graph" or similar? Might it then be kinda: 0) Take Expression in as text 1) Create state machine graph of the Expression and cache it 2) If requested, compile the state machine graph into IL -- and maybe cache that as well? There is also the question of performance between the static RegEx methods and an instance of a RegEx. Using an instance should at least eliminate the cost of a cache look-up. Part of why I ask is that I have written a number of methods which use Regular Expressions, and some of these get used by CLR Functions in SQL Server. I therefore want them to run as efficiently as they can as they may be called several million times during a query. Also, in your response, I suspect that when you say "runtime" you may mean "on each call", not just "on the first call".

          H Offline
          H Offline
          honey the codewitch
          wrote on last edited by
          #7

          PIEBALDconsult wrote:

          I also found that (apparently) the RegEx class caches the expressions and I suspect that it must "compile" them as well, otherwise why bother?

          I haven't read anything about the caching, but what I can tell you is that algorithmically it is very little work to convert a regular expression into an NFA state machine. It's hardly worth caching, unless they actually mean caching the state machine, but I think that's already done when you call Parse() and get an instance back. It doesn't actually reconstitute it from the string each time. Converting to a DFA *does* take time, but Microsoft's engine doesn't use DFA. DFA is much faster in the general case than NFA, but doesn't backtrack. Therefore my library beats out Microsoft's particularly when the input is shorter than say, several pages of text by about 3x under ideal cases - MS's has things like Boyer-Moore optimization which can scan long strings faster. If you really want the fastest possible performance and you know what the expressions are ahead of time - like they are relatively static/don't change that often, then make them compiled. Otherwise, leave them uncompiled. If you'd like something really fast (and you know the expressions ahead of time) I can hook you up with my little engine. It's pretty small. It doesn't support anchors or backtracking, but does full UTF-32 Unicode. (You'll need to use the ToUtf32 function to convert UTF-16 to UTF-32.

          Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

          J 1 Reply Last reply
          0
          • H honey the codewitch

            I needed a project to ward of boredom in the wee hours. I decided to do something I can't do - Reflection Emit from C# I built a Compile() feature into my regular expression engine that works like .NET's does. After wrestling with it for a little over a day, pay dirt! Gosh, IL is weird, but pretty simple once you get the hang of it. The VM is a very basic stack machine. It's extremely minimalist compared to modern CISC CPUs. Anyway, now I need to figure out how to write an article around it once I'm done basking in my accomplishment. :laugh: Edit: I wrote the article: Compiling DFA Regular Expressions into .NET Assemblies[^] I'm pretty proud of this code. Edit 2: Updated the timing to remove the time for console writing and I'm blown away.

            Microsoft Regex "Lexer": [■■■■■■■■■■] 100% Done in 1556ms
            Microsoft Regex Compiled "Lexer": [■■■■■■■■■■] 100% Done in 1186ms
            Expanded NFA Lexer: [■■■■■■■■■■] 100% Done in 109ms
            Compacted NFA Lexer: [■■■■■■■■■■] 100% Done in 100ms
            Unoptimized DFA Lexer: [■■■■■■■■■■] 100% Done in 111ms
            Optimized DFA Lexer: [■■■■■■■■■■] 100% Done in 6ms
            Table based DFA Lexer: [■■■■■■■■■■] 100% Done in 4ms
            Compiled DFA Lexer: [■■■■■■■■■■] 100% Done in 5ms

            Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

            K Offline
            K Offline
            Kenneth Haugland
            wrote on last edited by
            #8

            You can force C# to work like C++ with this code :laugh: I also remembered that a workaround for a bug in C# involved some Emit coding, which makes it pretty useful to know.

            H 1 Reply Last reply
            0
            • K Kenneth Haugland

              You can force C# to work like C++ with this code :laugh: I also remembered that a workaround for a bug in C# involved some Emit coding, which makes it pretty useful to know.

              H Offline
              H Offline
              honey the codewitch
              wrote on last edited by
              #9

              I was diving through the native asm code for this trying to track down a weird bug. I was calling a function off the wrong object (the object didn't have that function) and it created a stack overflow. Whoops. But I did notice the emitted native code was pretty similar to the IL.

              Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

              1 Reply Last reply
              0
              • P PIEBALDconsult

                My main thing is that I don't know what the Compiled Option actually does. I have an idea what it might or should do, but then it doesn't seem to make much sense not to include it every time. The "Compiled" option can't instruct the compiler to make IL of the expression at compile time, it has to be at runtime. I can definitely understand that "compiling" can help when a RegEx is used many times, but does it slow down a RegEx which is used only once or a few times? Is there some number of calls where performance matches (for a particular expression)? I also found that (apparently) the RegEx class caches the expressions and I suspect that it must "compile" them as well, otherwise why bother? I just read this: https://stackoverflow.com/questions/513412/how-does-regexoptions-compiled-work[^] Which I don't consider canon, though it may be correct. So Compiled makes it IL at some cost -- great, perfect. But non-Compiled must do something (at lower cost) as well which can be cached -- "turns the expression into a state machine graph" or similar? Might it then be kinda: 0) Take Expression in as text 1) Create state machine graph of the Expression and cache it 2) If requested, compile the state machine graph into IL -- and maybe cache that as well? There is also the question of performance between the static RegEx methods and an instance of a RegEx. Using an instance should at least eliminate the cost of a cache look-up. Part of why I ask is that I have written a number of methods which use Regular Expressions, and some of these get used by CLR Functions in SQL Server. I therefore want them to run as efficiently as they can as they may be called several million times during a query. Also, in your response, I suspect that when you say "runtime" you may mean "on each call", not just "on the first call".

                H Offline
                H Offline
                honey the codewitch
                wrote on last edited by
                #10

                Regarding my earlier response about suggesting my engine. I re-ran it to compare MS uncompiled and compiled with my engine. Edit: Changed my timing to remove the time for Console.Write/Console.WriteLine

                Microsoft Regex "Lexer": [■■■■■■■■■■] 100% Done in 1556ms
                Microsoft Regex Compiled "Lexer": [■■■■■■■■■■] 100% Done in 1186ms
                Expanded NFA Lexer: [■■■■■■■■■■] 100% Done in 109ms
                Compacted NFA Lexer: [■■■■■■■■■■] 100% Done in 100ms
                Unoptimized DFA Lexer: [■■■■■■■■■■] 100% Done in 111ms
                Optimized DFA Lexer: [■■■■■■■■■■] 100% Done in 6ms
                Table based DFA Lexer: [■■■■■■■■■■] 100% Done in 4ms
                Compiled DFA Lexer: [■■■■■■■■■■] 100% Done in 5ms

                Take that how you will.

                Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

                P 1 Reply Last reply
                0
                • H honey the codewitch

                  Regarding my earlier response about suggesting my engine. I re-ran it to compare MS uncompiled and compiled with my engine. Edit: Changed my timing to remove the time for Console.Write/Console.WriteLine

                  Microsoft Regex "Lexer": [■■■■■■■■■■] 100% Done in 1556ms
                  Microsoft Regex Compiled "Lexer": [■■■■■■■■■■] 100% Done in 1186ms
                  Expanded NFA Lexer: [■■■■■■■■■■] 100% Done in 109ms
                  Compacted NFA Lexer: [■■■■■■■■■■] 100% Done in 100ms
                  Unoptimized DFA Lexer: [■■■■■■■■■■] 100% Done in 111ms
                  Optimized DFA Lexer: [■■■■■■■■■■] 100% Done in 6ms
                  Table based DFA Lexer: [■■■■■■■■■■] 100% Done in 4ms
                  Compiled DFA Lexer: [■■■■■■■■■■] 100% Done in 5ms

                  Take that how you will.

                  Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

                  P Offline
                  P Offline
                  PIEBALDconsult
                  wrote on last edited by
                  #11

                  I don't doubt your code and abilities. But I can't just use anything I want, my employer allows only packages they certify for use. Additionally, I'm unsure yours could be used from within a CLR Function. Aside from that, you say that there are a few limitations of yours, and they might be critical to me -- you mentioned anchors, and some of my more complex expressions use the \G anchor.

                  1 Reply Last reply
                  0
                  • H honey the codewitch

                    PIEBALDconsult wrote:

                    I also found that (apparently) the RegEx class caches the expressions and I suspect that it must "compile" them as well, otherwise why bother?

                    I haven't read anything about the caching, but what I can tell you is that algorithmically it is very little work to convert a regular expression into an NFA state machine. It's hardly worth caching, unless they actually mean caching the state machine, but I think that's already done when you call Parse() and get an instance back. It doesn't actually reconstitute it from the string each time. Converting to a DFA *does* take time, but Microsoft's engine doesn't use DFA. DFA is much faster in the general case than NFA, but doesn't backtrack. Therefore my library beats out Microsoft's particularly when the input is shorter than say, several pages of text by about 3x under ideal cases - MS's has things like Boyer-Moore optimization which can scan long strings faster. If you really want the fastest possible performance and you know what the expressions are ahead of time - like they are relatively static/don't change that often, then make them compiled. Otherwise, leave them uncompiled. If you'd like something really fast (and you know the expressions ahead of time) I can hook you up with my little engine. It's pretty small. It doesn't support anchors or backtracking, but does full UTF-32 Unicode. (You'll need to use the ToUtf32 function to convert UTF-16 to UTF-32.

                    Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

                    J Offline
                    J Offline
                    jschell
                    wrote on last edited by
                    #12

                    honey the codewitch wrote:

                    I can tell you is that algorithmically it is very little work to convert a regular expression into an NFA state machine

                    Does your regular expression language support everything that the C# one does?

                    H 1 Reply Last reply
                    0
                    • J jschell

                      honey the codewitch wrote:

                      I can tell you is that algorithmically it is very little work to convert a regular expression into an NFA state machine

                      Does your regular expression language support everything that the C# one does?

                      H Offline
                      H Offline
                      honey the codewitch
                      wrote on last edited by
                      #13

                      Nope, because if it did it would be much slower. It supports non-backtracking expressions, sans anchors. As such it's maybe 250x faster than Microsoft's in the best case. (see my perf numbers) I'm also working on a code generation feature for it so it will have that since Regex in .NET 7 does. It already compiles, but honestly there's little reason to compile it. You can turn and expression into an array of ints instead and it matches at least as fast in most cases.

                      Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

                      1 Reply Last reply
                      0
                      • P PIEBALDconsult

                        My main thing is that I don't know what the Compiled Option actually does. I have an idea what it might or should do, but then it doesn't seem to make much sense not to include it every time. The "Compiled" option can't instruct the compiler to make IL of the expression at compile time, it has to be at runtime. I can definitely understand that "compiling" can help when a RegEx is used many times, but does it slow down a RegEx which is used only once or a few times? Is there some number of calls where performance matches (for a particular expression)? I also found that (apparently) the RegEx class caches the expressions and I suspect that it must "compile" them as well, otherwise why bother? I just read this: https://stackoverflow.com/questions/513412/how-does-regexoptions-compiled-work[^] Which I don't consider canon, though it may be correct. So Compiled makes it IL at some cost -- great, perfect. But non-Compiled must do something (at lower cost) as well which can be cached -- "turns the expression into a state machine graph" or similar? Might it then be kinda: 0) Take Expression in as text 1) Create state machine graph of the Expression and cache it 2) If requested, compile the state machine graph into IL -- and maybe cache that as well? There is also the question of performance between the static RegEx methods and an instance of a RegEx. Using an instance should at least eliminate the cost of a cache look-up. Part of why I ask is that I have written a number of methods which use Regular Expressions, and some of these get used by CLR Functions in SQL Server. I therefore want them to run as efficiently as they can as they may be called several million times during a query. Also, in your response, I suspect that when you say "runtime" you may mean "on each call", not just "on the first call".

                        J Offline
                        J Offline
                        jochance
                        wrote on last edited by
                        #14

                        Every time I link stack overflow from this day forward: "Which I don't consider canon, though it may be correct." :-D

                        P 1 Reply Last reply
                        0
                        • J jochance

                          Every time I link stack overflow from this day forward: "Which I don't consider canon, though it may be correct." :-D

                          P Offline
                          P Offline
                          PIEBALDconsult
                          wrote on last edited by
                          #15

                          Glad to be of service.

                          1 Reply Last reply
                          0
                          Reply
                          • Reply as topic
                          Log in to reply
                          • Oldest to Newest
                          • Newest to Oldest
                          • Most Votes


                          • Login

                          • Don't have an account? Register

                          • Login or register to search.
                          • First post
                            Last post
                          0
                          • Categories
                          • Recent
                          • Tags
                          • Popular
                          • World
                          • Users
                          • Groups