2011/7/14 Chris Torek <nos...@torek.net>: > In article <mailman.1002.1310591600.1164.python-l...@python.org> > Vlastimil Brom <vlastimil.b...@gmail.com> wrote: >>I'd like to ask about the availability of a text diff library, like >>difflib, which would support the detection of moved text blocks. > > If you allow arbitrary moves, the "minimal edit distance" problem > (string-to-string edit) becomes substantially harder. If you only > allow insert, delete, or in-place-substitute, you have what is > called the "Levenshtein distance" case. If you allow transpositions > you get "Damerau-Levenshtein". These are both solveable with a > dynamic programming algorithm. Once you allow move operations, > though, the problem becomes NP-complete. > > See http://pages.cs.brandeis.edu/~shapird/publications/JDAmoves.pdf > for instance. (They give an algorithm that produces "usually > acceptable" results in polynomial time.) > -- > In-Real-Life: Chris Torek, Wind River Systems > > Thanks for the references and explanation! I do realise the added complexity with taking the moves into account; given that, my current needs and the usually satisfying results obtained easily with difflib, I am not going to try to implement some more complex diffing algorithm. However, it turns out that the mentioned naive approach with just recomparing the text additions and removals may be partially viable - with some luck, i.e. given, the relevant segments are identified as deletions and inserts and isolated by difflib in the first place (and not subsumed under larger changes or split).
For illustration, the rough simplified code is attached (sorry for the style and possible quirks...) Just now the more similar text segments are just collected, it would be also possible to sort them on their similarity ratio; the current approach also allows to highlight potentially multiple moved segments. Comments and suggestions are, of course, welcome, regards, vbr # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #! Python # -*- coding: utf-8 -*- import difflib import itertools def compare_moves(a, b, similarity_threshold=0.6): """ Poor man's text comparison with simple moves check. Compares two strings using difflib and additionally tries to detect moved blocks by comparing similar deleted and inserted segments with each other - given the similarity_threshold. """ seq_matcher = difflib.SequenceMatcher(isjunk=None, a=a, b=b, autojunk=False) diff_raw = [[tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]] for tag, i1, i2, j1, j2 in seq_matcher.get_opcodes()] deleted, inserted = {}, {} for tag, i1, i2, j1, j2 in seq_matcher.get_opcodes(): if tag == 'delete': deleted[(i1, i2)] = [tag, i1, i2, j1, j2, a[i1:i2]] elif tag == 'insert': inserted[(i1, i2)] = [tag, i1, i2, j1, j2, b[j1:j2]] possibly_moved_blocks = [] for deleted_item, inserted_item in itertools.product(deleted.values(), inserted.values()): similarity_ratio = difflib.SequenceMatcher(isjunk=None, a=deleted_item[5], b=inserted_item[5], autojunk=False).ratio() if similarity_ratio >= similarity_threshold: possibly_moved_blocks.append([deleted_item, inserted_item, similarity_ratio]) print diff_raw print possibly_moved_blocks if __name__ == "__main__": compare_moves("abcXYZdeABfghijklmnopABBCq", "ABCDabcdeACfgXYXZhijklmnopq", similarity_threshold = 0.6) # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # output: # [['insert', 0, 0, 0, 4, '', 'ABCD'], ['equal', 0, 3, 4, 7, 'abc', 'abc'], ['delete', 3, 6, 7, 7, 'XYZ', ''], ['equal', 6, 9, 7, 10, 'deA', 'deA'], ['replace', 9, 10, 10, 11, 'B', 'C'], ['equal', 10, 12, 11, 13, 'fg', 'fg'], ['insert', 12, 12, 13, 17, '', 'XYXZ'], ['equal', 12, 21, 17, 26, 'hijklmnop', 'hijklmnop'], ['delete', 21, 25, 26, 26, 'ABBC', ''], ['equal', 25, 26, 26, 27, 'q', 'q']] [[['delete', 21, 25, 26, 26, 'ABBC'], ['insert', 0, 0, 0, 4, 'ABCD'], 0.75], [['delete', 3, 6, 7, 7, 'XYZ'], ['insert', 12, 12, 13, 17, 'XYXZ'], 0.8571428571428571]]
compare_moves.py
Description: Binary data
-- http://mail.python.org/mailman/listinfo/python-list