Re: [PATCH] limit diff effort to fix performance issue

Stefan Sperling Fri, 01 Apr 2022 04:11:51 -0700

On Fri, Apr 01, 2022 at 12:44:24PM +0200, Johan Corveleyn wrote:
> On Tue, Jun 8, 2021 at 5:58 PM Johan Corveleyn <jcor...@gmail.com> wrote:
> > On Tue, Jun 8, 2021 at 3:24 PM Stefan Sperling <s...@stsp.name> wrote:
> > > On Tue, Jun 08, 2021 at 02:57:34PM +0200, Johan Corveleyn wrote:
> > > > Okay, I focused on the first revision causing the annotate to differ,
> > > > and with some debug logging:
> > > > - p went up to 139
> > > > - length[0]=1942, length[1]=1817
> > > >
> > > > Now, 1942 lines on the left and 1817 on the right doesn't seem all
> > > > that many (that's what remains after prefix-suffix stripping). I see
> > > > no reason 'svn diff' shouldn't be able to calculate a minimal diff for
> > > > those sizes in a reasonable amount of time. So if p can go up to such
> > > > a "cost" for such "reasonable" lenghts, I imagine we should put the
> > > > actual limit much higher. I suppose we can just as well set "min_cost"
> > > > to 1024 or even higher. 64 is way too low.
> > >
> > > Thanks!
> > > I have set the value back to at least 1024 with this new version of patch.
> >
> > Hmmm, apparently I'm still running into the limit. Even with 8192 I'm
> > hitting it at least once. The actual diff there is not really pretty,
> > but it's an actual manual change made by some developer years ago, and
> > the "minimal diff" is more or less readable / understandable. The log
> > message is something like "Group sections related to multimedia
> > module", and it shuffles around a lot of xml sections to group them
> > together.
> >
> > In this case, length[0]==length[1]==46780. Pretty big for the LCS, but
> > not humongous. The value of 'p' goes up to 8279.
> >
> > With the limit set to 16384, I'm not hitting it, and the annotate
> > comes out as with the original.
> >
> > Now, I'm a bit puzzled why limit==8192 already takes 18s in your
> > tests. In my situation, with the above diff, I get:
> > limit==8192: 3.3s
> > limit==16384: 3.5s
> >
> > (that's including downloading both versions from the server over https)
> >
> > So I'm wondering why it takes so long in your case. One thing to keep
> > in mind here is that, since this is cpu intensive, optimizations might
> > matter. I remember back in 2011 that I did most of my measurements
> > with a "Release" build for example. But the above tests I did were
> > with a debug build, so ... dunno.
> 
> [ snip part about another optimization by Morten Kloster in r1140247,
> which seemed not to work for Stefan. ]
> 
> Coming back to this thread from last year, just wondering if you want
> to pick this up again, Stefan (or anyone else), since a 1.15 release
> might be happening in the near future.
> 
> I think we already concluded that changing this behaviour in a patch
> release would be hard to justify. But for 1.15 it might be acceptable
> in some form.
> 
> Just wanted to bring this back up, I have no vested interest here. I'm
> still skeptical about changing the output of diff & blame in cases
> where we hit "this takes too much work", but I won't stand in the way
> if that what others consense on :-). If we want to pick a limit of the
> maximum effort, then I'd suggest 16384 (or above), but that's a purely
> personal / local suggestion, because the use case I had above would
> then not be impacted.
> 
> BTW: it's a pity that we can's say: "limit the diff effort in case of
> 'update' or 'merge', but not in case of 'diff' or 'blame'". Because in
> the latter case, the user might care more about the actual output, and
> in the former they might not (until they hit a text conflict of
> course, then they will care again ...)
> 
> It might be interesting if someone could take a more "profiling" look
> at why Stefan's example takes much more time, even with limit==8192
> (18 seconds), compared to my example (3.3 seconds). Is that purely
> because of the total size of the files / number of lines? I find that
> difference quite strange. There might be something that can be
> optimized without changing behaviour. But I'm afraid I don't have time
> to dig deeply into this.


My assumption by now is that the algorithm needs to be changed.
This would essentially amount to a complete rewrite of the diff code.

There must be a fundamental difference to how 'git diff' works,
for example, which succeeds on the large files I have tested
in a relatively short amount of time.

Even 'got diff' (from my side-project gameoftrees.org), which does not
focus on performance as a first-class feature, completes a diff of the
problematic test case discussed in this thread in about 3 minutes.
It uses a diff implemention written by Neels, with the Patience algorithm.

I have shelved this topic for now. I have no desire to rewrite Subversion's
diff code. And the original use case is an edge case where people are
versioning large auto-generated XML files for reasons I don't understand.
They worked around it by configuring 'git diff' as a diff tool for
Subversion to use, and have not complained since.
I suspect they could also change their process to work around this issue,
but such changes are not always easy to get pushed through in a large
organization. But, in the end, these are home-made problems which could
be solved by using Subversion as it is intended to be used.

Cheers,
Stefan

Re: [PATCH] limit diff effort to fix performance issue

Reply via email to