On Thu, Jun 16, 2011 at 11:41 PM, Stefan Sperling <s...@elego.de> wrote: > On Thu, Jun 16, 2011 at 10:38:12PM +0200, Johan Corveleyn wrote: >> Are you sure? Maybe there are different definitions of how patience >> diff works (it seems to be rather undocumented :-)). I'm not really >> into the details, but from what I've read on this thread on the git >> mailinglist [1], I thought it worked recursively, more or less like >> this: >> >> (1) Find the unique lines, and perform an LCS on them. >> >> (2) For each in-between section, again perform step (1) (finding >> unique lines for this section, which may turn up new unique lines, >> unique to this section). >> >> Repeat this recursively, until there are no more unique lines in a >> section. Then fall back to the regular diff algorithm (a "standard >> minimal diff" on the remaining section). > > As far as I can tell the recursion was dropped later: > > http://bramcohen.livejournal.com/73318.html > "I've previously described it with the ordering a bit different and a > recursive step at the end..."
Hm, I've read that blog entry, but it's lacking a lot of detail. For starters, the explanation of "how it works" is incomplete: [[[ 1. Match the first lines of both if they're identical, then match the second, third, etc. until a pair doesn't match. 2. Match the last lines of both if they're identical, then match the next to last, second to last, etc. until a pair doesn't match. 3. Find all lines which occur exactly once on both sides, then do longest common subsequence on those lines, matching them up. 4. Do steps 1-2 on each section between matched lines ]]] So ... in step 4, after executing 1-2 on the remaining section, then what :-)? I'm guessing he means that you should then run a standard LCS on the remaining non-matching section. Or does he want to consider the entire part, between the points where 1 and 2 stopped, a mismatch??? Also, the explanation why he dropped the "recursive step at the end" is very hand-wavy: "... and the performance hit of doing the recursion isn't worth it, because it rarely if ever finds any more matches, and even when it does it isn't clear whether the extra matches produce a functionally superior diff." Hmmmm, what does that mean? Does it mean that if you zoom in on a section-between-unique-matches, it's usually of such a form that a patience diff doesn't produce a better result than a standard diff? Strange. Also, I'm not sure if this would be such a big performance hit (with the current token counting in place in libsvn_diff, it's not such a big effort to recount "token_counts" for a subsection of lines). Note that 1-2 is more-or-less the prefix/suffix-scanning we already do. Though the way it's implemented now in libsvn_diff, it's not suited for recursive use like this, because it's low-level, byte-based, ... before the tokens/lines are extracted. But it would be pretty trivial to implement something similar with sets of tokens/lines. Also, I think if you follow the above algorithm strictly, you'll end up with ugly diffs, similar to the example Bram Cohen gives in this same article. Because those prefix/suffix-scanning steps will blindly consider identical lines as part of the common lines, something that's not always wanted (hence the entire reason of existence for patience diff in the first place). Blind prefix/suffix-scanning should keep some identical lines available to leave some wiggle room for the main algorithm (that's why we have the "#define SUFFIX_LINES_TO_KEEP 50" in diff_file.c, which makes sure the suffix isn't gobbled up completely (with the current LCS algorithm, there is no need to do this in the prefix scanning, because the LCS algorithm will also always consider lines common as long as it can, while scanning forward)). Anywayzzz ... we'll see when we get there I suppose :-). -- Johan