From: Vlastimil Brom <vlastimil.b...@gmail.com> Date: 2010/4/16 Subject: unexpected output from difflib.SequenceMatcher
... Instead of just reporting the insertion and deletion of these single characters ... the output of the SequenceMatcher decides to delete a large part of the string in between the differences and to insert the almost same text after that. ... Just for the record, althought it seemed unlikely to me first, it turns out, that this may have the same cause like several difflib items in the issue tracker regarding unexpected outputs for long sequences with relatively highly repetitive items, e.g. http://bugs.python.org/issue2986 http://bugs.python.org/issue1711800 http://bugs.python.org/issue4622 http://bugs.python.org/issue1528074 In my case, disabling the "popular" heuristics as mentioned in http://bugs.python.org/issue1528074#msg29269 i.e. modifying the difflib source (around line 314 for py.2.5.4) to if 0: # disable popular heuristics if n >= 200 and len(indices) * 100 > n: populardict[elt] = 1 del indices[:] seems to work perfectly. Anyway, I would appreciate comments, whether this is the appropriate solution for the given task - i.e. the character-wise comparison of strings; or are there maybe some drawbacks to be aware of? Wouldn't some kind of control over the "pouplar" heuristics be useful in the exposed interface of difflib? Or is this just the inappropriate tool for the character-wise string comparison, as is suggested e.g. in http://bugs.python.org/issue1528074#msg29273 althought it seems to work just right for the most part? regards, vbr -- http://mail.python.org/mailman/listinfo/python-list