Levenstein Distance

berndg

Assume LD(A, S) = a and LD(S, B) = b, what can say about the relation of a and b, and about LD(A, B)? Is LD() transitive? What is a good choice for S? Would the empty string suffice for S? Any constructive thoughts are welcome. Bernd

Daniel Turini

berndg wrote: Would the empty string suffice for S? Not a formal proof (I don't know if this is your homework), but: LD(A, '') == strlen(A) LD(B, '') == strlen(B) LD(A, B) == ? Intuitively, when calculating LD(A, S) or LD(B, S) you lose the information of which edits you made, keeping only the edit count and this information is needed to make some relation of a, b with LD(A,B) Maybe if you represent LD(X) with a number formed by insertions and deletions in a base B (B > strlen(X)), you may be able to find some relationship, but I don't know, it's just a guess. Perl combines all the worst aspects of C and Lisp: a billion different sublanguages in one monolithic executable. It combines the power of C with the readability of PostScript. -- Jamie Zawinski

berndg

Daniel, Thanks for sharing your thoughts. I am pleased to report that I have outgrown homework by some 20 years or so :-) You're right about LD(A, '&apos) - good point. More thinking required. The idea is to pre-compute LD(A,S) for all entries A of a dictionary D. When searching for a phrase B in that dictionary, all that should be needed was to compute LD(S,B) and to present D in descending order of LD(A,S). The idea is to provide a error-tolerant look-up mechanism. Think electronic phone book. As you type the name you're searching, the list presented starts with those records that are most similar to your search phrase, showing identical matches at the top, followed by similar entries with an increasingly alienated nature. More thoughts on LD() are welcome (I find this very interesting), but maybe I am totally on the wrong track with Levenstein distances at all? Bernd

Daniel Turini

Approximate searching isn't feasible with LD, unfortunately, unless the database is very short, as you’ll need to scan the whole list and compute the LD with the input; my company developed for EQUIFAX Brazilian subsidiary some years ago and we have done a lot of research about searching similar names, and there are tons of algorithms for similarity ordering, grouping and so on. Sometimes, you have some algorithms that will give you a candidate list, and you can use LD to further refine it. One of the best algorithms we developed (talk about reinventing the wheel) is similar to this this. Theirs is more sophisticated, but the general idea of "learning" was the same. This paper will also explain several reasons why LD is not a good algorithm for several real-world problems. In the past few years, this field of work has received lots of attention. If you spend some time googling, you’ll find lots of references to protein searches: the GENOMA project needed to find similar proteins on large DNA sequences. A DNA protein is formed by the “letters” A,T,C,G (hence the sci-fi movie name “Gattaca”) and the DNA is a huge sequence of this. For chemical reasons, small differences on ATCG sequences often don’t matter. BTW, the correct name is 'Levenshtein' distance, you're losing a lot of Google hits by searching the wrong name :) Perl combines all the worst aspects of C and Lisp: a billion different sublanguages in one monolithic executable. It combines the power of C with the readability of PostScript. -- Jamie Zawinski

Navin

Don't you mean: LD(A, '') == _tcslen(A) LD(B, '') == _tcslen(B) :-D "I'd be up a piece if I hadn't swallowed my bishop." Mr. Ed, playing chess

Daniel Turini

No, in that case, I would write: LD(A, _T("")) == _tcslen(A) LD(B, _T("")) == _tcslen(B) :-D Perl combines all the worst aspects of C and Lisp: a billion different sublanguages in one monolithic executable. It combines the power of C with the readability of PostScript. -- Jamie Zawinski