On Mon, Jul 23, 2018 at 9:41 PM Jonathan Nieder <[email protected]> wrote:
>
> Hi,
>
> Stefan Beller wrote:
>
> > As a user I wondered what the diff algorithms are about. Offer at least
> > a basic explanation on the differences of the diff algorithms.
>
> Interesting. Let's see.
>
> [...]
> > --- a/Documentation/diff-options.txt
> > +++ b/Documentation/diff-options.txt
> > @@ -94,16 +94,34 @@ diff" algorithm internally.
> > Choose a diff algorithm. The variants are as follows:
> > +
> > --
> > -`default`, `myers`;;
> > - The basic greedy diff algorithm. Currently, this is the default.
> > `minimal`;;
> > + A diff as produced by the basic greedy algorithm described in
>
> Why this reordering?
because Myers (below) is constructed from minimal + heuristic.
Before this patch we say Myers is myers and minimal "spends
extra cycles", which is true if you read the code, as it just is
for (...) {
test_path
if (need_min)
continue;
if (heuristic_good_enough())
break;
}
https://github.com/git/git/blob/master/xdiff/xdiffi.c#L152
So the "minimal" algorithm is the basic myers algorithm,
and the "myers" algorithm is the extended version (extended
with a heuristic to be fast in corner cases, still producing good
enough diffs).
>
> > + link:http://www.xmailserver.org/diff2.pdf[An O(ND) Difference
> > Algorithm and its Variations]
>
> This URL doesn't seem likely to stay stable. Are appearances
> deceiving? (Maybe so, since that's the same host as libxdiff's
> homepage <http://www.xmailserver.org/xdiff-lib.html>.) It might be
> worth mentioning the author and date just in case.
I though about it and did not mention it, as the name
"An O(ND) Difference Algorithm and its Variations" is already
enough to find that paper quickly.
> > - Spend extra time to make sure the smallest possible diff is
> > - produced.
> > +`default`, `myers`;;
> > + The same algorithm as `minimal`, extended with a heuristic to
> > + reduce extensive searches. Currently, this is the default.
>
> Amusing --- this means that the Myers diff is spelled as "minimal"
> instead of "myers".
>
> > `patience`;;
> > - Use "patience diff" algorithm when generating patches.
> > + Use "patience diff" algorithm when generating patches. This
> > + matches the longest common subsequence of unique lines on
> > + both sides, recursively. It obtained its name by the way the
> > + longest subsequence is found, as that is a byproduct of the
> > + patience sorting algorithm. If there are no unique lines left
> > + it falls back to `myers`. Empirically this algorithm produces
> > + a more readable output for code, but it does not garantuee
> > + the shortest output.
>
> Probably also worth mentioning that this was invented by Bram Cohen
> for the bazaar vcs.
I tried to balance the part that reads like a history lesson and the
immediately useful part (why is this better, when should I use it?)
Mentioning bazaar probably makes sense. Not sure about the author,
but I'll include him in a reroll of this patch.
> > `histogram`;;
> > - This algorithm extends the patience algorithm to "support
> > - low-occurrence common elements".
> > + This algorithm re-implements the `patience` algorithm with
>
> What does reimplements mean here? Do you mean that it is a variation
> on patience?
It is supposed to be a variation of patience, but as it comes with its
own implementation which I would not fully trust (as it was ported
from JGit, and reading the comments over there, a misunderstanding
of LCS made it possible to have only one midpoint before recursing)
So it is not just taking the patience algorithm and adding support for
"low-occurrence common elements" (what does that even mean?
patience already distinguishes uniques and non-uniques), but
its own implementation, hence it is not bug-for-bug compatible.
> > + "support of low-occurrence common elements" and only picks
> > + one element of the LCS for the recursion. It is often the
>
> Does LCS mean longest common subsequence or longest common substring
> here? Probably best to spell it out to avoid confusion.
When writing the patch I meant to refer to longest common subsequence
from above, but by picking just one element, this algorithm understands it
as longest common string.
>
> > + fastest, but in cornercases (when there are many longest
>
> s/cornercases/corner cases/
>
> > + common subsequences of the same length) it produces bad
> > + results as seen in:
> > +
> > + seq 1 100 >one
> > + echo 99 > two
> > + seq 1 2 98 >>two
> > + git diff --no-index --histogram one two
>
> I wonder if these details (about all the algorithms) should go in a
> separate Diff Algorithms section and this section could focus on
> what use cases warrant using each, referring to that section for
> details. What do you think?
Yeah I thought about putting together a
Documentation/technical/diffs.txt that talk about all kinds of diffing
(basic diff algorithms, options, how to diff trees), but considered
rolling this as a band aid until that document fully materializes.
Thanks,
Stefan