On 15.06.2011 01:08, Johan Corveleyn wrote: > On Tue, Jun 14, 2011 at 5:33 PM, Stefan Sperling <s...@elego.de> wrote: >> On Tue, Jun 14, 2011 at 05:21:27PM +0200, Neels J Hofmeyr wrote: >>> Hi Johan, >>> >>> it's been a while and I still haven't sent you my diff wish we briefly >>> touched on the Subversion hackathon. > Hi Neels, thanks for pursuing this further. > >>> Here is a fabricated example of why I don't like diff to match empty lines: >>> A couple of lines get replaced by completely different ones. By matching the >>> blank line in the middle, it becomes far less readable, IMHO. In my fantasy >>> dream world, this diff would print: >>> >>> [[[ >>> Index: x >>> =================================================================== >>> --- x (revision 1) >>> +++ x (working copy) >>> @@ -4,11 +4,13 @@ >>> >>> void aaa() >>> { >>> - if (x) >>> - do(things); >>> - >>> - if (y) >>> - do(stuff); >>> + while (x || y) >>> + { >>> + check(something); >>> + notify(stuff); >>> + >>> + try(somethingelse); >>> + } >>> >>> bb(b); >>> } >>> ]]] > Yeah, that's certainly a nicer diff for human consumption :-). But > strictly speaking it's a larger diff (more lines marked as +/-), so > that makes it less optimal for the current algorithm. > > The "minimality" criterion of diff (with the LCS) makes it easy to > reason about, and makes for a nice and clear mathematical definition > of the requested diff result. But I agree that it doesn't necessarily > lead to "good-quality" diffs for human readers. > > So: good-quality != minimal, but it's more of a "soft" criterion, > depends on the language, context, ... what lines are important to the > user, ... > > Introducing heuristics in one form or another is probably the only way > to improve this. I'm not an expert in this area myself (I'm actually > more of a theoretical mathematician, so I'm naturally skeptical of > anything without a formal proof :-)). But I have also read some good > things about patience diff, like Stefan suggested ... > >> Do you know about patience diff? >> http://bramcohen.livejournal.com/73318.html >> I think we should try teaching this algorithm to svn diff at some point. >> It's a lot more generic than just checking for empty lines and should >> yield the results you want. > Or if Morten has some great inspiration along similar lines, that > might be equally good or better... > > On Tue, Jun 14, 2011 at 7:53 PM, Morten Kloster <mor...@gmail.com> wrote: >> Actually, I was already planning on making the LCS algorithm estimate >> how statistically significant each matching section is (just a simple >> heuristic, of course, nothing mathematically exact) - I need this for the >> proposals in my recent thread "Improvements to diff3 (merge) >> performance" - and the standard diff could then take an option -noflukes >> (for instance) which would only keep the significant matches. That >> should eliminate (or at least reduce) both problems with blank lines and >> similar issues with braces being matched in unrelated code. >> >> Estimating the significance should be quite quick, so no worries there. >> --- >> Morten > Intuitively, I'd say: let's look into patience diff (or a variation > thereof), because it's already being used in several (D)VCS'es, so it > has already had a lot of exposure. But that's not really a strong > argument :-). Maybe another approach is easier to implement in > libsvn_diff, and yields equally good or even better results ... I > don't know. > > One thing I'm not sure about: suppose we have a really good "heuristic > diff", should we then make it the default (and make --minimal an > option), or should we make it optional (--nice)? I guess we'll see > whenever we get at the point where a heuristic diff is implemented > :-). > > Note that GNU diff uses some heuristics by default (and --minimal is > an option). Maybe GNU diff would already give a "nice" result with > your example with its built-in heuristics?
I'd say not to worry about --minimal and --nice and whatnot. Just make diff output the sanest, nicest diff it can find. I think it's a bad idea to give diff user-visible options that change the output in ways that are hard to explain (shuffling lines around, as opposed to, e.g., using a completely different diff format). -- Brane