On Nov 24, 8:43 pm, Peter Otten <__pete...@web.de> wrote: > John Yeung wrote: > > I'm generally pleased with difflib.SequenceMatcher: It's probably not > > the best available string matcher out there, but it's in the standard > > library and I've seen worse in the wild. One thing that kind of > > bothers me is that it's sensitive to which argument you pick as "seq1" > > and which you pick as "seq2": > > > Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit > > (Intel)] on > > win32 > > Type "help", "copyright", "credits" or "license" for more information. > >>>> import difflib > >>>> difflib.SequenceMatcher(None, 'BYRD', 'BRADY').ratio() > > 0.44444444444444442 > >>>> difflib.SequenceMatcher(None, 'BRADY', 'BYRD').ratio() > > 0.66666666666666663 > > > Is this a bug? I am guessing the algorithm is implemented correctly, > > and that it's just an inherent property of the algorithm used. It's > > certainly not what I'd call a desirably property. Are there any > > simple adjustments that can be made without sacrificing (too much) > > performance? > > def symmetric_ratio(a, b, S=difflib.SequenceMatcher): > return (S(None, a, b).ratio() + S(None, b, a).ratio())/2.0 > > I'm expecting 50% performance loss ;) > > Seriously, have you tried to calculate the ratio with realistic data? > Without looking into the source I would expect the two ratios to get more > similar. > > Peter
Surnames are extremely realistic data. The OP should consider using Levenshtein distance, which is "symmetric". A good (non-naive) implementation should be much faster than difflib. ratio = 1.0 - levenshtein(a, b) / float(max(len(a), len(b))) -- http://mail.python.org/mailman/listinfo/python-list