Jay <jay.srid...@gmail.com> writes: > I am having an odd problem with difflib.SequenceMatcher. Sample code below: > > The strings "src" and "trg" differ only a little.
How exactly? (Please be precise, it helps testing.) > The SequenceMatcher.ratio() for these strings 0.0. Many other similar > strings are working fine without problems (see below) with non-zero > ratios depending on how much difference there is between strings (as > expected). Calling SM(...,trg[1:],src[1:]) gives plausible result. See also the result of .get_matching_blocks() on your strings (it returns no matching blocks). It is all due to the "Autojunk" heuristics (see difflib's doc for details), which considers the first characters as junk. Call SM(...,autojunk=False). I have no idea why the maintainers made this stupid autojunk idea the default. Complain with them. -- Alain. > Tested on Python 2.7 on Ubuntu 14.04 > > Program follows: > --- > from difflib import SequenceMatcher as SM > > src = u"N KPT T HS KMNST KNFKXNS AS H KLT FR 0 ALMNXN AF PRFT PRPRT AN > RRL ARS T P RPLST P KMNS H ASTPLXT HS ANTSTRL KR0 PRKRM NN AS 0 KRT LP > FRRT 0S PRKRM KLT FR 0 RPT TRNSFRMXN AF XN FRM AN AKRRN AKNM T A SSLST > ANTSTRL SST" > trg = u"M KPT T HS KMNST KNFKXNS AS H KLT FR 0 ALMNXN AF PRFT PRPRT AN > RRL ARS T P RPLST P KMNS H ASTPLXT HS ANTSTRL KR0 PRKRM NN AS 0 KRT LP > FRRT 0S PRKRM KLT FR 0 RPT TRNSFRMXN AF XN FRM AN AKRRN AKNM T SSLST > ANTSTRL SST" > print src, '\n', trg, '\n', SM(None, trg, src).ratio() -- https://mail.python.org/mailman/listinfo/python-list