Inverse regex matching
-
This one is tricky, that's for sure. I'm looking for a way to generate strings based on a regular expression. I tried several searches on google and the only stuff that turned up was a library called Grail+ and some perl module genex. Before I'm going to hack my way into integrating perl into my C# application I want to make sure that I haven't missed a library that does what I need. Is there anyone who has done this before or knows a library which can help me with this?
WM. What about weapons of mass-construction? "What? Its an Apple MacBook Pro. They are sexy!" - Paul Watson My blog
-
This one is tricky, that's for sure. I'm looking for a way to generate strings based on a regular expression. I tried several searches on google and the only stuff that turned up was a library called Grail+ and some perl module genex. Before I'm going to hack my way into integrating perl into my C# application I want to make sure that I haven't missed a library that does what I need. Is there anyone who has done this before or knows a library which can help me with this?
WM. What about weapons of mass-construction? "What? Its an Apple MacBook Pro. They are sexy!" - Paul Watson My blog
Are you just trying to generate random strings that match the regex? What could this possibly be used for? I doubt such functionality is already in the .Net framework, but I am not certain. You could create your own state machine based on a regex, then getting a random string is done trivially by randomly choosing paths from one state to another until you arrive at one of the success states (unless you want to bound or specify the resulting string's length). Anyway, good luck, and let us know if you find a solution elsewhere.
Sounds like somebody's got a case of the Mondays -Jeff
-
Are you just trying to generate random strings that match the regex? What could this possibly be used for? I doubt such functionality is already in the .Net framework, but I am not certain. You could create your own state machine based on a regex, then getting a random string is done trivially by randomly choosing paths from one state to another until you arrive at one of the success states (unless you want to bound or specify the resulting string's length). Anyway, good luck, and let us know if you find a solution elsewhere.
Sounds like somebody's got a case of the Mondays -Jeff
Indeed, I want to generate random strings that match the regex. I'm generating random data for a database which has some fields that need to match to a specific pattern. I'm pretty sure .NET doesn't have the functionality I'm looking for. I think creating my own state machine is indeed the best way to go around this, however it's not going to be easy and there is a problem with the max length of strings. I need to find a way to limit the length of the data generated by the * and + operator otherwise this is impossible to do. Thanks nonetheless
WM. What about weapons of mass-construction? "What? Its an Apple MacBook Pro. They are sexy!" - Paul Watson My blog
-
Indeed, I want to generate random strings that match the regex. I'm generating random data for a database which has some fields that need to match to a specific pattern. I'm pretty sure .NET doesn't have the functionality I'm looking for. I think creating my own state machine is indeed the best way to go around this, however it's not going to be easy and there is a problem with the max length of strings. I need to find a way to limit the length of the data generated by the * and + operator otherwise this is impossible to do. Thanks nonetheless
WM. What about weapons of mass-construction? "What? Its an Apple MacBook Pro. They are sexy!" - Paul Watson My blog
Here is how I would implement it, if I were going to do what you describe (where 'n' is the desired length of the result):
1. Build a directed graph of your regex with nodes==state and edges==allowed state transitions
2. Build a structure containing the number of steps that can be made from
any one state to any other that is <= the length of your result (2d matrix of lists)
3. Randomly choose an acceptance state (named 'as') where your connectivity matrix
indicates you can go from the start state to the final state in exactly n transitions.
4. Iterate through the following (initialize cur = startStateNode) until your string is found:
a. Set 'n' to 'n-1'
b. Out of all of the nodes reachable from 'cur' in one move,
eliminate those that can not reach 'as' in 'n' moves
c. Choose one of the resulting nodes in the collection found in (b)Once you finish with this process, you will have a randomly generated string matching your regex. Note, however, that strings produced do NOT have equal probability of being chosen. For example, if your regex is, say "(Becky|a\w*)", then "Becky" will result half of the time, whereas words beginning with the letter "a" (this set is clearly much larger) would take the other half. Choosing one at random with equal probability would require you to find ALL matching strings of your given length first, then choosing one of those strings at random (or, you could use graph theory and weight each choice by the number of possible ways to arrive at the destination in the required number of moves). Whatever you decide, I hope you can see that this is clearly not a trivial computation, and it will require a significant (IMO) source of man-hours to code. Good luck,
Sounds like somebody's got a case of the Mondays -Jeff
-
Here is how I would implement it, if I were going to do what you describe (where 'n' is the desired length of the result):
1. Build a directed graph of your regex with nodes==state and edges==allowed state transitions
2. Build a structure containing the number of steps that can be made from
any one state to any other that is <= the length of your result (2d matrix of lists)
3. Randomly choose an acceptance state (named 'as') where your connectivity matrix
indicates you can go from the start state to the final state in exactly n transitions.
4. Iterate through the following (initialize cur = startStateNode) until your string is found:
a. Set 'n' to 'n-1'
b. Out of all of the nodes reachable from 'cur' in one move,
eliminate those that can not reach 'as' in 'n' moves
c. Choose one of the resulting nodes in the collection found in (b)Once you finish with this process, you will have a randomly generated string matching your regex. Note, however, that strings produced do NOT have equal probability of being chosen. For example, if your regex is, say "(Becky|a\w*)", then "Becky" will result half of the time, whereas words beginning with the letter "a" (this set is clearly much larger) would take the other half. Choosing one at random with equal probability would require you to find ALL matching strings of your given length first, then choosing one of those strings at random (or, you could use graph theory and weight each choice by the number of possible ways to arrive at the destination in the required number of moves). Whatever you decide, I hope you can see that this is clearly not a trivial computation, and it will require a significant (IMO) source of man-hours to code. Good luck,
Sounds like somebody's got a case of the Mondays -Jeff
Thanks for the outline, it got me thinking about how regex parsers are actually build. I found some old material on how to write your own regex parser and I'm going to use that for my library. My plan is to implement a partial regex parser. This parser will be constructing an NFA of the regex expression and convert it to a DFA. After that I'm going to determine which parts of the DFA will have a fixed length, those parts will need to be generated first. When these parts are generated I will be filling up the rest of the string with the other variable length parts, starting with the operator + (You need at least one if these) and after that the * operator.
WM. What about weapons of mass-construction? "What? Its an Apple MacBook Pro. They are sexy!" - Paul Watson My blog