Neal Becker wrote: > difflib.get_close_matches looks useful. But, I don't see where it defines > 'close'. Besides that, wouldn't it be much more useful if one could > supply their own distance metric?
If you have a distance function you can find the N best matches with >>> from heapq import nsmallest >>> from functools import partial >>> from Levenshtein import distance >>> possibilities = ["ape", "apple", "peach", "puppy"] >>> nsmallest(3, possibilities, key=partial(distance, "appel")) ['ape', 'apple', 'puppy'] With a cutoff it gets a bit messier... >>> pairs = ((distance("appel", v), v) for v in possibilities) >>> pairs = ((score, v) for score, v in pairs if score <= 2) >>> [v for score, v in nsmallest(3, pairs)] ['ape', 'apple'] so you would want to wrap it in a function, but if you have a look into difflib.get_close_matches()... def get_close_matches(word, possibilities, n=3, cutoff=0.6): if not n > 0: raise ValueError("n must be > 0: %r" % (n,)) if not 0.0 <= cutoff <= 1.0: raise ValueError("cutoff must be in [0.0, 1.0]: %r" % (cutoff,)) result = [] s = SequenceMatcher() s.set_seq2(word) for x in possibilities: s.set_seq1(x) if s.real_quick_ratio() >= cutoff and \ s.quick_ratio() >= cutoff and \ s.ratio() >= cutoff: result.append((s.ratio(), x)) # Move the best scorers to head of list result = heapq.nlargest(n, result) # Strip scores for the best n matches return [x for score, x in result] there is a lot of stuff that only makes sense if you use a SequenceMatcher to calculate the similarity. For a generalized version you will probably have to throw out the range check for the cuttof and the optimizations. I don't think it's worthwile. Peter -- http://mail.python.org/mailman/listinfo/python-list