Steven D'Aprano wrote: > [EMAIL PROTECTED] wrote: > >> This algorithm is called soundex. Here is one implementation example. >> >> http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52213 >> >> here is another: >> http://effbot.org/librarybook/soundex.htm > > > Soundex is *one* particular algorithm for approximate string matching. > It is optimised for matching Anglo-American names (like Smith/Smythe), > and is considered to be quite old and obsolete for all but the most > trivial applications -- or so I'm told. > > Soundex will not match arbitrary changes -- it will match both cat and > cet, but it won't match cat and mat. > > A more sophisticated approximate string matching algorithm will use the > Levenshtein distance. You can find a Useless implementation here: > > http://www.uselesspython.com/download.php?script_id=108 > > > Given a function levenshtein(s1, s2) that returns the distance between > two strings, you could use it for approximate matching like this: > > def approx_matching(strlist, target, dist=1): > """Matches approximately strings in strlist to > a target string. > > Returns a list of strings, where each string > matched is no further than an edit distance of > dist from the target. > """ > found = [] > for s in strlist: > if levenshtein(s, target) <= dist: > found.append(s) > return s
I compute a relative metric based on th every same implementation like this: def relative(a, b): """ Computes a relative distance between two strings. Its in the range (0-1] where 1 means total equality. @type a: string @param a: arg one @type b: string @param b: arg two @rtype: float @return: the distance """ d = levensthein(a,b) longer = float(max((len(a), len(b)))) shorter = float(min((len(a), len(b)))) r = ((longer - d) / longer) * (shorter / longer) return r The idea is that otherwise e.g. "cat" and "hippopothamus" have a l-distance of only 3, which one would consider good at the first look. Regards, Diez -- http://mail.python.org/mailman/listinfo/python-list