Comparing "Similar" Strings
-
I know this is a toughie, but I figured I'd ask... I am writing an internal application for my company. We have thousands of Articles on our web server in HTML (actually ASP) format. These articles are technical in nature, and support the various software programs we write. This tool is meant to scan the text of those articles, and convert the information into a new format that we'll be storing in a database. Part of that conversion process will include identifying the product and version that each article is associated with. This is really easy when the product name is spelled correctly and formatted the same as what I'd expect it to be. Unfortunately, people make mistakes (lots and lots of mistakes) and what I expect is rarely what is there. For a silly example, let's say that I'm scanning the text for the word "Microsoft" to see if this article is associated with a Microsoft product. Easy enough, right? But as I start looking at articles, I see: Micro Soft Microssoft Microsoff MS Microsucks MSoft .. .. Well, you get the idea. Doing a simple InStr(Article, "Microsoft") doesn't always find what I want. What I need is a more "fuzzy" compare method. Microsoft SQL Server has a function called LIKE that is kind of close to this - better yet, I can use the FreeText or Contains methods to provide a fuzzy search. Is there a similar function or technique in VB that would allow me to do a compare like this? -Todd Davis (toddhd@hotmail.com)
-
I know this is a toughie, but I figured I'd ask... I am writing an internal application for my company. We have thousands of Articles on our web server in HTML (actually ASP) format. These articles are technical in nature, and support the various software programs we write. This tool is meant to scan the text of those articles, and convert the information into a new format that we'll be storing in a database. Part of that conversion process will include identifying the product and version that each article is associated with. This is really easy when the product name is spelled correctly and formatted the same as what I'd expect it to be. Unfortunately, people make mistakes (lots and lots of mistakes) and what I expect is rarely what is there. For a silly example, let's say that I'm scanning the text for the word "Microsoft" to see if this article is associated with a Microsoft product. Easy enough, right? But as I start looking at articles, I see: Micro Soft Microssoft Microsoff MS Microsucks MSoft .. .. Well, you get the idea. Doing a simple InStr(Article, "Microsoft") doesn't always find what I want. What I need is a more "fuzzy" compare method. Microsoft SQL Server has a function called LIKE that is kind of close to this - better yet, I can use the FreeText or Contains methods to provide a fuzzy search. Is there a similar function or technique in VB that would allow me to do a compare like this? -Todd Davis (toddhd@hotmail.com)
Not unless you write it. I found a couple of resources on the 'Net about the subject just by searching for 'fuzzy string compare'. You might want to try converting this[^] Delphi source. You also might want to try working something up using Regular Expressions. I don't have any code, but it's an idea I would look into. RageInTheMachine9532
-
I know this is a toughie, but I figured I'd ask... I am writing an internal application for my company. We have thousands of Articles on our web server in HTML (actually ASP) format. These articles are technical in nature, and support the various software programs we write. This tool is meant to scan the text of those articles, and convert the information into a new format that we'll be storing in a database. Part of that conversion process will include identifying the product and version that each article is associated with. This is really easy when the product name is spelled correctly and formatted the same as what I'd expect it to be. Unfortunately, people make mistakes (lots and lots of mistakes) and what I expect is rarely what is there. For a silly example, let's say that I'm scanning the text for the word "Microsoft" to see if this article is associated with a Microsoft product. Easy enough, right? But as I start looking at articles, I see: Micro Soft Microssoft Microsoff MS Microsucks MSoft .. .. Well, you get the idea. Doing a simple InStr(Article, "Microsoft") doesn't always find what I want. What I need is a more "fuzzy" compare method. Microsoft SQL Server has a function called LIKE that is kind of close to this - better yet, I can use the FreeText or Contains methods to provide a fuzzy search. Is there a similar function or technique in VB that would allow me to do a compare like this? -Todd Davis (toddhd@hotmail.com)
-
There is a good article on this website about it http://www.codeproject.com/string/dmetaphone6.asp[^] Rugby League: The Greatest Game Of All.
Try doing a SOUNDEX search - when people spell incorrectly, the word usually sounds the same - SOUNDEX creates a numeric value for a string based in its phonetics, so Smithe and Smythe would result in the same SOUNDEX value - believe SQL Server (and definately Oracle - ner!) support it straight out of the box... "Now I guess I'll sit back and watch people misinterpret what I just said......" Christian Graus At The Soapbox