Complex String Comparison
-
What I need to do is compare a series of strings (actually filenames) and find potential duplicates. The strings, however, are unlikely to contain literal duplicates so what I need to do is find similar strings. Has anyone ever had experience of this before? Cheers James
-
What I need to do is compare a series of strings (actually filenames) and find potential duplicates. The strings, however, are unlikely to contain literal duplicates so what I need to do is find similar strings. Has anyone ever had experience of this before? Cheers James
Yes, this is a common need in credit applications (find names mistyped). In your case, I suggest the Levenshtein String (or Edit) Distance algorithm for this. You can find tons of implementations in C++ on google. lazy isn't my middle name.. its my first.. people just keep calling me Mel cause that's what they put on my drivers license. - Mel Feik
-
Yes, this is a common need in credit applications (find names mistyped). In your case, I suggest the Levenshtein String (or Edit) Distance algorithm for this. You can find tons of implementations in C++ on google. lazy isn't my middle name.. its my first.. people just keep calling me Mel cause that's what they put on my drivers license. - Mel Feik
Thanks for that Daniel. I took a look at it and that solves part of my problem. The other part of the problem is that I need to find duplicates based on whether a string has some elements of another.For example, I would want to match the following filenames c:\Music\Albums\Vines\01 - Highly Evolved.mp3 c:\Music\Singles\The Vines - Highly Evolved.mp3 f:\Stuff\Vines - Highly Evolved.wma Do you know if there is a standard way to do this? I don't think there is but I just wanted to make sure. Cheers James
-
Thanks for that Daniel. I took a look at it and that solves part of my problem. The other part of the problem is that I need to find duplicates based on whether a string has some elements of another.For example, I would want to match the following filenames c:\Music\Albums\Vines\01 - Highly Evolved.mp3 c:\Music\Singles\The Vines - Highly Evolved.mp3 f:\Stuff\Vines - Highly Evolved.wma Do you know if there is a standard way to do this? I don't think there is but I just wanted to make sure. Cheers James
If you are really into informatics as a science, maybe some algorithms from bioinformatics can help you. Finding substring in one another and computing similarity-distance is quite common in DNA-sequence handling. You will find more stuff about that in the net than you are able to read in the entire rest of your life.