John Machin wrote: > On Wed, 18 May 2005 15:06:53 -0500, Ed Morton <[EMAIL PROTECTED]> > wrote: > > >> >>William Park wrote: >> >> >>>How do you compare 2 strings, and determine how much they are "close" to >>>each other? Eg. >>> aqwerty >>> qwertyb >>>are similar to each other, except for first/last char. But, how do I >>>quantify that? >>> >>>I guess you can say for the above 2 strings that >>> - at max, 6 chars out of 7 are same sequence --> 85% max >>> >>>But, for >>> qawerty >>> qwerbty >>>max correlation is >>> - 3 chars out of 7 are the same sequence --> 42% max >>> >>>(Crossposted to 3 of my favourite newsgroup.) >>> >> >>"However you like" is probably the right answer, but one way might be to >>compare their soundex encoding >>(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure out >>percentage difference based on comparing the numeric part. >> > > > Fantastic suggestion. Here's a tiny piece of real-life test data: > > compare the surnames "Mousaferiadis" and "McPherson". >
Fantastic test data set. I know how to pronounce McPherson but I'd never have guessed that Mousaferiadis sounds like it. I suppose non-Celts probably wouldn't be able to guess how Dalziell, Drumnadrochit, Culzean, Ceilidh, or Concobarh are pronounced either. I assume you were actually being facetious and trying to make the point that names that don't look the same on paper can have the same soundex encoding and that's obviously countered with the fact that soundex is just a cheap and cheerful way to find names that probably sound similair which can vary tremendously based on ethnicity or accent. It's a reasonable approach to consider given the very loose requirements presented. Ed. -- http://mail.python.org/mailman/listinfo/python-list