Wildcard Matching Routine
-
I'm writing a small routine to match strings using simple wildcards, namely "?" and "*". So I have this dilemma: Take the wildcard string "abc*xyz". In your opinion, do you think that string should match only strings that begin with "abc" and end with "xyz", with any number of characters between? Or, should that string match anything that begins with "abc" and the "xyz" at the end is irrelevant because the star matches anything that comes after "abc"? The second case would be easy to code, and I imagine the first case would be more difficult. Which way would YOU expect the matching to work?
The difficult we do right away... ...the impossible takes slightly longer.
-
I'm writing a small routine to match strings using simple wildcards, namely "?" and "*". So I have this dilemma: Take the wildcard string "abc*xyz". In your opinion, do you think that string should match only strings that begin with "abc" and end with "xyz", with any number of characters between? Or, should that string match anything that begins with "abc" and the "xyz" at the end is irrelevant because the star matches anything that comes after "abc"? The second case would be easy to code, and I imagine the first case would be more difficult. Which way would YOU expect the matching to work?
The difficult we do right away... ...the impossible takes slightly longer.
Coding an ending mask (xyz) and then having it ignored seems illogical. To me, * represent 0 or one or many; ? represents any one match.
"Before entering on an understanding, I have meditated for a long time, and have foreseen what might happen. It is not genius which reveals to me suddenly, secretly, what I have to say or to do in a circumstance unexpected by other people; it is reflection, it is meditation." - Napoleon I
-
Coding an ending mask (xyz) and then having it ignored seems illogical. To me, * represent 0 or one or many; ? represents any one match.
"Before entering on an understanding, I have meditated for a long time, and have foreseen what might happen. It is not genius which reveals to me suddenly, secretly, what I have to say or to do in a circumstance unexpected by other people; it is reflection, it is meditation." - Napoleon I
Thanks for your opinion. I tend to agree. Happy holidays!
The difficult we do right away... ...the impossible takes slightly longer.
-
I'm writing a small routine to match strings using simple wildcards, namely "?" and "*". So I have this dilemma: Take the wildcard string "abc*xyz". In your opinion, do you think that string should match only strings that begin with "abc" and end with "xyz", with any number of characters between? Or, should that string match anything that begins with "abc" and the "xyz" at the end is irrelevant because the star matches anything that comes after "abc"? The second case would be easy to code, and I imagine the first case would be more difficult. Which way would YOU expect the matching to work?
The difficult we do right away... ...the impossible takes slightly longer.
Traditionally, the * wildcard in a regular expression is "zero or more instances of the preceding character", so the regex "abc*def" would match "abcdef", "abccccdef" and even "abdef" (0 c's), but not "abccxccdef". It might even match "Hello abcccdef there", depending on whether the regex is considered to be anchored or not. In a traditional regex the '.' wildcard is to match any character. However, if you're writing your own regex parser, you're free to place any meaning on the wildcard characters you want. The functionality you seem to be trying to reproduce seems very like unix file globbing. If that's what you're trying to do - and you are on a unix like system, you probably have access to
glob(3)
, which does all the heavy lifting for you. There's probably similar functionality for windows. But maybe you're writing a regex parser for your own purposes. In which case you're free to make up the rules you want. Either way, I'd expect any literals in the regex to be present in any matches found.Keep Calm and Carry On
-
Thanks for your opinion. I tend to agree. Happy holidays!
The difficult we do right away... ...the impossible takes slightly longer.
You too! And best in the new year.
"Before entering on an understanding, I have meditated for a long time, and have foreseen what might happen. It is not genius which reveals to me suddenly, secretly, what I have to say or to do in a circumstance unexpected by other people; it is reflection, it is meditation." - Napoleon I
-
Traditionally, the * wildcard in a regular expression is "zero or more instances of the preceding character", so the regex "abc*def" would match "abcdef", "abccccdef" and even "abdef" (0 c's), but not "abccxccdef". It might even match "Hello abcccdef there", depending on whether the regex is considered to be anchored or not. In a traditional regex the '.' wildcard is to match any character. However, if you're writing your own regex parser, you're free to place any meaning on the wildcard characters you want. The functionality you seem to be trying to reproduce seems very like unix file globbing. If that's what you're trying to do - and you are on a unix like system, you probably have access to
glob(3)
, which does all the heavy lifting for you. There's probably similar functionality for windows. But maybe you're writing a regex parser for your own purposes. In which case you're free to make up the rules you want. Either way, I'd expect any literals in the regex to be present in any matches found.Keep Calm and Carry On
Thank you for your input. I'm actually not going for regular expressions, just simple wildcard matching like in MS-DOS. Happy holidays!
The difficult we do right away... ...the impossible takes slightly longer.
-
Coding an ending mask (xyz) and then having it ignored seems illogical. To me, * represent 0 or one or many; ? represents any one match.
"Before entering on an understanding, I have meditated for a long time, and have foreseen what might happen. It is not genius which reveals to me suddenly, secretly, what I have to say or to do in a circumstance unexpected by other people; it is reflection, it is meditation." - Napoleon I
Given a text and a wildcard pattern, implement wildcard pattern matching algorithm that finds if wildcard pattern is matched with text. The matching should cover the entire text (not partial text).
The wildcard pattern can include the characters ‘?’ and ‘*’by clicking on [](<u>https://codeprozone.com</u>)codeprozone.
‘?’ – matches any single character
‘*’ – Matches any sequence of characters (including the empty sequence) -
I'm writing a small routine to match strings using simple wildcards, namely "?" and "*". So I have this dilemma: Take the wildcard string "abc*xyz". In your opinion, do you think that string should match only strings that begin with "abc" and end with "xyz", with any number of characters between? Or, should that string match anything that begins with "abc" and the "xyz" at the end is irrelevant because the star matches anything that comes after "abc"? The second case would be easy to code, and I imagine the first case would be more difficult. Which way would YOU expect the matching to work?
The difficult we do right away... ...the impossible takes slightly longer.
String must start with abc and end with xyz I also think that abc*xyz should match these: abcxyxyz abcxxyz abcxyz abcxxxxyyyyxyz Add them to your unit tests. The nice thing with this algorithm is that there is not a lot of state to track. Very simple to implement in any language.
-
String must start with abc and end with xyz I also think that abc*xyz should match these: abcxyxyz abcxxyz abcxyz abcxxxxyyyyxyz Add them to your unit tests. The nice thing with this algorithm is that there is not a lot of state to track. Very simple to implement in any language.
Slightly more complex if you want to extend it to a path of (back)slash separated directory levels with a directory name '**' indicating zero or more directory levels. I find this so useful that I would recommend that you include it from the very beginning. (Assuming, of course, that your wildcard routine is intended for file names, or similarly structured name strings.)
-
Slightly more complex if you want to extend it to a path of (back)slash separated directory levels with a directory name '**' indicating zero or more directory levels. I find this so useful that I would recommend that you include it from the very beginning. (Assuming, of course, that your wildcard routine is intended for file names, or similarly structured name strings.)
If * matched any character, then it would already match a slash. You would have to make * not match a slash to need ** as another symbol. I have seen and used the ** in a lot of utilities including Ant xml file sets. I have never had to write the logic for it, though.
-
If * matched any character, then it would already match a slash. You would have to make * not match a slash to need ** as another symbol. I have seen and used the ** in a lot of utilities including Ant xml file sets. I have never had to write the logic for it, though.
-
If * matched any character, then it would already match a slash. You would have to make * not match a slash to need ** as another symbol. I have seen and used the ** in a lot of utilities including Ant xml file sets. I have never had to write the logic for it, though.
If you write a matching routine for file/path names, you should at least as an option treat the path separators differently. For a general matching routine, the (set of) separator character(s) should be a parameter, so the same routine can be used in different contexts, e.g. different file systems. I guess that writing a match for ** that doesn't use recursion would require more effort and the code would be more difficult to comprehend than to do it recursively. I wouldn't ever consider flattening that recursive matching routine I use in my code. (But of course, like in all recursion, I take care to reduce the stack frame to a minimum.)
-
If you write a matching routine for file/path names, you should at least as an option treat the path separators differently. For a general matching routine, the (set of) separator character(s) should be a parameter, so the same routine can be used in different contexts, e.g. different file systems. I guess that writing a match for ** that doesn't use recursion would require more effort and the code would be more difficult to comprehend than to do it recursively. I wouldn't ever consider flattening that recursive matching routine I use in my code. (But of course, like in all recursion, I take care to reduce the stack frame to a minimum.)
The original post does not mention file system anywhere. I was thinking more in terms of a pure programming exercise. For traversing directory structures, I use a Visitor pattern with methods of beginDir/endDir/foundFile. This allows easy reuse of the recursive algorithm across a dozen utilities that process file trees. Keeps the stack lean for the recursion, any bloat ends up in the Visitor’s heap memory. (Including a few utilities that needed their own stack data structure to perform their job)
-
The original post does not mention file system anywhere. I was thinking more in terms of a pure programming exercise. For traversing directory structures, I use a Visitor pattern with methods of beginDir/endDir/foundFile. This allows easy reuse of the recursive algorithm across a dozen utilities that process file trees. Keeps the stack lean for the recursion, any bloat ends up in the Visitor’s heap memory. (Including a few utilities that needed their own stack data structure to perform their job)
englebart wrote:
The original post does not mention file system anywhere.
Sure. That is why I explicitly point out that my comments are valid if the application is a file system directory search. That doesn't mean "if and only if", though. I think my remarks are valid even for other cases where the string somehow expresses a hierarchical structure or classification. It would be equally valid for, say, the military way (at least in Norway) of referring to items in ways such as "pants under long white", with space as the level separator. You may of course argue "But when there is no structure at all, then this talk about separators is irrelevant" - which is perfectly true. I just took it one step further, generalising it a little bit so that it also included those case where a hierarchical structure is implied. In quite a lot of the cases you encoounter in software systems, that is indeed the case!