Tim Peters added the comment:

I'm sympathetic, but I don't see a good solution here without using 
incompatible code.

ndiff was built to generate "the highest quality diff possible", for text 
written and edited by humans, where "quality" is measured by human judgment 
(not an abstract mathematical measure).  That was its goal, and I think it does 
well at that.  It wasn't intended to unwind mechanical changes made to 
machine-generated gibberish (to human eyes) text.

Not that such a thing has no value, but it was never ndiff's _intent_ to handle 
such stuff.  So the best solution here would be to use a different differencing 
engine.

As already noted, unified_diff skips any attempt to find "merely similar" 
lines, and that's the great source of expense here:  ndiff takes time 
essentially cubic in the number of lines, given inputs with a great many 
similar (but no identical) lines.  It was noted that kdiff3 does fine on these 
inputs very quickly, but, from the kdiff3 FAQ:  "If similar lines appear next 
to each other, this actually is coincidence but this fortunately is often the 
case."  It just so happens that in the inputs attached to this report, the 
first lines in each file pair are similar, and the second lines likewise, etc 
etc.  This allows kdiff3 to do a wonderful job without any time at all spent 
_trying_ to find similar lines.

But there's nothing coincidental here in what ndiff does:  it finds "_the_ most 
similar pair", and that's a quadratic-time operation (which turns into cubic 
time when repeated over & over).

I didn't write any of the HTML-generating functions, nor have I ever used them, 
so I'm not in a good position to judge what they "should do".  If similar lines 
aren't interesting in this context, then switching to unified_diff would save 
gobs of time.  If similar lines are interesting, then new code would be 
required to get at _some_ notion of that less expensively than ndiff does it 
(e.g., the similarity scores for all pairs could be computed once and saved 
away, reducing worst-case cubic time to worst-case quadratic time, but at the 
cost of needing additional memory quadratic in the number of lines).

I'm -1 on 11740.patch:  there is no reason, in general, to give up looking for 
the most similar pair - and give up on intraline differencing - simply because 
the total number of lines in the two blocks exceeds 99.  It would reduce diff 
quality for all uses of the Differ class, not just for the specific uses of 
Differ made indirectly (via ndiff) by the make_table function - and it would be 
backward-incompatible anyway.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue6931>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to