GZ wrote:
<snip>
I want a library that does unix 'diff' like function, i.e. compare two
strings line by line and output the difference. Python's difflib does
not work perfectly for me, because the resulting differences are
pretty big. I would like an algorithm that generates the smallest
differences.
<snip>
Thanks for your response.

The verboseness of the format is not really my problem and I only care
about line by line comparison for now.

Let me think of a better way to express what I mean by a "smaller
diff." After I diff the two strings, I will have something like this:

  AAA
- BBB
+ CCC
+ DDD
- EEE

It means the first line does not change, the second line is replaced
by the third line, the forth line is new, and the fifth line is
deleted.

I define the "smallness" of the diff algorithm as "the sum of the
total number of minuses and pluses". In my above example, it is 4 (two
minuses and 2 pluses). Note that no matter what format we use to
represent the diff, this number is the same.

Python's difflib does not really minimize this number. It tries to
make this number small, but also tries to yield matches that “look
right” to people at the cost of increasing this number. (http://
docs.python.org/library/difflib.html).

What I am looking for is an algo that can really minimize this number.

The - lines aren't needed, any more than the context lines are needed, so that will typically cut your results in half. But perhaps the real algorithm you're describing is one that permits lines to be moved as well as inserted and deleted. If you then consider that information to be free (it's not part of the measure you're specifying), you may do best with the following algorithm:

Scan the two files looking for lines that appear exactly once in each file. Consider those lines the first invariants, and everything else a potential change. Now, starting with each pair (which I call a bridge between islands), look at the line previous and the line following, and if either or both are also matched, expand the two islands to include them. As an island grows to butt up against another island, merge them.

Now, each bridge is a "move" operation, and every line left over in either file is either a plus or a minus, an insert or a delete. For most editing transactions, there will be relatively few of these.

For example, if three functions were reordered, from A, B, C, to A, C, B, there would be no plus or minus lines at all.

DaveA

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to