On Fri, Apr 01, 2022 at 04:17:45PM +0200, Johan Corveleyn wrote:
> - Perhaps the fundamental diff algorithm in SVN is fine, but it has a
> performance bug / low-level inefficiency? I think that should be
> explored first, because it might fix most of this problem without
> requiring a discussion about the precise output, and without huge
> efforts of rewrites.

I myself find the current SVN diff code very difficult to make sense of.
Maybe someone smarter than me (like Sander Striker?) could figure this out.

> - I think a lot of other diff tools out there have an "effort-limit",
> like the one you proposed (like --speed-large-files, aka
> "use-heuristic" for GNU diff [1], [2], which might be the default
> behaviour, dunno). I'm guessing 'git diff' has one too. So agreed,
> perhaps we need one too. Just not entirely sure how high we should put
> the limit, and what we have to communicate around "diff output might
> have changed / might not be a minimal diff" (and whether we still need
> an option to fallback to "go over the limit").

The only way I found to terminate SVN diff code early leaves a
giant ugly diff chunk at the bottom of the diff.
Other diff implementations do not have this issue. And I don't know
how this issue could be fixed in the current code.

> - Patience diff might also be an option (though it might have its own
> set of problems and peculiar edge cases, I don't know). It might even
> be possible to reuse part of the current code for this (Not sure this
> requires a complete rewrite. Does it still require a LCS algorithm at
> its core, after picking apart the problem space? I haven't studied it
> enough to know how it works internally). In this case too there is the
> back-compat vs behavioral-change discussion that will take time to
> settle (Do we change the default algorithm / output? Do we make it
> optional, or do we already have too manu knobs? ...).

Yes, roughly, Patience diff involves two algorithms, the grouping of
lines along similar-line boundaries performed by Patience, and an
LCS for parts of the files which Patience cannot deal with by itself.

But LCS needs to complete its work before Patience can do its thing,
and vice-versa. For a given input, each algorithm might run multiple
times, switching whenever the current algorithm cannot make any process.
So if LCS still has a fundamental performance issue for some inputs,
adding Patience diff probably won't help. It could help in cases where
LCS is now fed with different inputs as determined by Patience, such
that bad inputs for LCS are avoided. But I don't know if that would
always be the case.

Reply via email to