comparison of two strings - how to get a score
-
Hi folks, I'm looking to find some code, or guidance for producing my own code, which would give me a score (say a percent value) on how similar two strings are. Any ideas are appreciated. Note: I've found, online, something about a Trigram algorithm for the purpose, but couldn't understand it fully. It seemed to be looking more at the meaning of two sentences (i.e. if they contain synonyms), which isn't quite what I'm looking for. As an example, if my comparison sentence is: "I'm looking for a needle in a haystack" then I'd like "I'm looking for needles in a haystack" to give me a better score than "I'm looking for needles in haystacks", but the scores should be pretty close. Thanks R
-
Hi folks, I'm looking to find some code, or guidance for producing my own code, which would give me a score (say a percent value) on how similar two strings are. Any ideas are appreciated. Note: I've found, online, something about a Trigram algorithm for the purpose, but couldn't understand it fully. It seemed to be looking more at the meaning of two sentences (i.e. if they contain synonyms), which isn't quite what I'm looking for. As an example, if my comparison sentence is: "I'm looking for a needle in a haystack" then I'd like "I'm looking for needles in a haystack" to give me a better score than "I'm looking for needles in haystacks", but the scores should be pretty close. Thanks R
Hi folks, Well, there seems to be little interest in the subject. Either that, or I wasn't very clear on what I wanted - or, few people had what to say of instructive on the matter. No matter the case, in case anyone is looking for a solution, here is what I've found: There is, amongst other things, a good article online (a university paper in fact) that is easy to understand, to the point, and free to download from here: http://arxiv.org/abs/cs.DS/0112022 (See PostScript or PDF link at bottom) Anyway, based on this very good document, I've written a C# function to compare two strings, and give a percentage on how close the strings are to one another as such: // Found http://arxiv.org/abs/cs.DS/0112022 // July 6, 2004 // From: qi xiao yang // Date (v1): Fri, 21 Dec 2001 05:58:12 GMT (250kb) // Date (revised v2): Tue, 25 Dec 2001 21:29:22 GMT (294kb) // Faster Algorithm of String Comparison // Authors: Qi Xiao Yang, Sung Sam Yuan, Lu Chun, Li Zhao, Sun Peng private double StringCompareValue(string str1, string str2) { string strX, strY; // smaller and bigger of the strings, respectively double val; // return value int window, wPos; // window width and position double SSNC = 0; // value used to calculate return in the end // get smaller & bigger strings sorted out // also, perform lower case conversion if(str1.Length < str2.Length) { strX = str1.ToLower(); strY = str2.ToLower(); } else { strX = str2.ToLower(); strY = str1.ToLower(); } // initialize window = strX.Length; wPos = 0; // while the window exists and the smallest string exists while(window > 0 && strX.Length > 0) { // while the window doesn't overlap the end of the string while((wPos + window) <= strX.Length) { // get string to compare to string comparer = strX.Substring(wPos, window); // start comparison point at zero int cPos = 0; // loop through second string to get all possible comparisons while(cPos + window <= strY.Length) { // if comparisons match if(comparer == strY.Substring(cPos, window)) { // update SSNC Value SSNC += Math.Pow((2 * window), 2); // remove "used characters" strY = strY.Remove(cPos, window); strX = strX.Remove(wPos, window); // force window position to stay same place (moved forward again at end of loop) wPos--;; // continue search with next window position break; } // move right on lon