Hi all, Once in a while I happen to stumble on some not expected behaviour of difflib.SequenceMatcher, in this case I hopefully managed to replicate it on an illustrative sample text. Both strings differ in a minimal way, each having one extra character in a "strategic" position, which seems to meet some pathological case for difflib. Instead of just reporting the insertion and deletion of these single characters (which works well for most cases - with most other positions of the differing characters), the output of the SequenceMatcher decides to delete a large part of the string in between the differences and to insert the almost same text after that. I didn't find any mentions of such cases in the documentation and, honestly, I wasn't able to follow the sourcecode of difflib to make i t clearer, hence I would like to ask for some hints. Can this behaviour be avoided or worked around in some way? (I thought about repeatedly trying sequence matcher on replaced parts, but this doesn't help, if there is an insertion and deletion in the opcodes). Or is this maybe some inherent possibility of the algorithm, which cannot be dealt with reasonably?
The attached code simply prints the results of the comparison with the respective tags, and substrings. No junk function is used. I get the same results on Python 2.5.4, 2.6.5, 3.1.1 on windows XPp SP3. Thanks in advance for any hints, Regards, vbr ############################################################# #! Python # -*- coding: utf-8 -*- import difflib # txt_a - extra character A at index 196 txt_a = "Chapman: *I* don't know - Mr Wentworth just told me to come in here and say that there was trouble at the mill, that's all - I didn't expect a kind of Spanish Inquisition.[jarring chord] Ximinez: ANobody expects the Spanish Inquisition! Our chief weapon is surprise...surprise and fear...fear and surprise.... Our two weapons are fear and surprise...and ruthless efficiency.... Our *three* weapons are fear, surprise, and ruthless efficiency...and an almost fanatical devotion to the Pope.... Our *four*...no... *Amongst* our weapons.... Amongst our weaponry...are such elements as fear, surprise.... I'll come in again." # txt_b - extra character B at index 525 txt_b = "Chapman: *I* don't know - Mr Wentworth just told me to come in here and say that there was trouble at the mill, that's all - I didn't expect a kind of Spanish Inquisition.[jarring chord] Ximinez: Nobody expects the Spanish Inquisition! Our chief weapon is surprise...surprise and fear...fear and surprise.... Our two weapons are fear and surprise...and ruthless efficiency.... Our *three* weapons are fear, surprise, and ruthless efficiency...and an almost fanatical devotion to the Pope.... Our *four*...no... *Amongst* our Bweapons.... Amongst our weaponry...are such elements as fear, surprise.... I'll come in again." seq_match = difflib.SequenceMatcher(None, txt_a, txt_b) print ("\n".join("%7s a[%d:%d] (%s) b[%d:%d] (%s)" % (tag, i1, i2, txt_a[i1:i2], j1, j2, txt_b[j1:j2]) for tag, i1, i2, j1, j2 in seq_match.get_opcodes()))
difflib_test_inq.py
Description: Binary data
-- http://mail.python.org/mailman/listinfo/python-list