On Thu, 19 May 2005 06:38:59 +1000, John Machin <[EMAIL PROTECTED]> wrote:
> On Wed, 18 May 2005 15:06:53 -0500, Ed Morton <[EMAIL PROTECTED]> > wrote: > >>William Park wrote: >> >>> How do you compare 2 strings, and determine how much they are "close" to >>> each other? Eg. >>> aqwerty >>> qwertyb >>> are similar to each other, except for first/last char. But, how do I >>> quantify that? >> >>"However you like" is probably the right answer, but one way might be to >>compare their soundex encoding >>(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure out >>percentage difference based on comparing the numeric part. > > Fantastic suggestion. Here's a tiny piece of real-life test data: > > compare the surnames "Mousaferiadis" and "McPherson". I remember a word processor, MultiMate, which used Soundex to do matching for its suggestions for spelling correction. One of my cow-orkers typed the word 'agains' -- it was supposed to be 'against' but 'again' would also have been a sensible suggestion. MultiMate, however, suggested neither of those reasonable words, it did suggest "iguanas" amd "Utahns"... (I wonder what it does with "Talliafero" and "Tolliver", and with "Featherstone-Howe" and "Fanshaw"...) The answer to the OP, which someone just pointed out to me on comp.programming, is "edit distance" or "Levenshtein Distance"[1]. A google search on either produces some good descriptions in the first few links, including http://www.merriampark.com/ld.htm which has not only a description of the algorithm but also source code in Java, C++ and Visual Basic (no Awk, thought there are links to pages with implementations in Perl, Python, Delphi and many more)... [1] I would have spelt it 'Levenstein', and pronounced it 'Levenshtein' in Schwaebisch (south German) fashion, but apparently the author of the algorithm was one Vladimir I. Levenshtein and that is how he is credited on the IEEE site. There are also a number of Google hits under the 'stein' spelling, though... Chris C -- http://mail.python.org/mailman/listinfo/python-list