On Fri, Apr 01, 2022 at 12:44:24PM +0200, Johan Corveleyn wrote: > On Tue, Jun 8, 2021 at 5:58 PM Johan Corveleyn <jcor...@gmail.com> wrote: > > On Tue, Jun 8, 2021 at 3:24 PM Stefan Sperling <s...@stsp.name> wrote: > > > On Tue, Jun 08, 2021 at 02:57:34PM +0200, Johan Corveleyn wrote: > > > > Okay, I focused on the first revision causing the annotate to differ, > > > > and with some debug logging: > > > > - p went up to 139 > > > > - length[0]=1942, length[1]=1817 > > > > > > > > Now, 1942 lines on the left and 1817 on the right doesn't seem all > > > > that many (that's what remains after prefix-suffix stripping). I see > > > > no reason 'svn diff' shouldn't be able to calculate a minimal diff for > > > > those sizes in a reasonable amount of time. So if p can go up to such > > > > a "cost" for such "reasonable" lenghts, I imagine we should put the > > > > actual limit much higher. I suppose we can just as well set "min_cost" > > > > to 1024 or even higher. 64 is way too low. > > > > > > Thanks! > > > I have set the value back to at least 1024 with this new version of patch. > > > > Hmmm, apparently I'm still running into the limit. Even with 8192 I'm > > hitting it at least once. The actual diff there is not really pretty, > > but it's an actual manual change made by some developer years ago, and > > the "minimal diff" is more or less readable / understandable. The log > > message is something like "Group sections related to multimedia > > module", and it shuffles around a lot of xml sections to group them > > together. > > > > In this case, length[0]==length[1]==46780. Pretty big for the LCS, but > > not humongous. The value of 'p' goes up to 8279. > > > > With the limit set to 16384, I'm not hitting it, and the annotate > > comes out as with the original. > > > > Now, I'm a bit puzzled why limit==8192 already takes 18s in your > > tests. In my situation, with the above diff, I get: > > limit==8192: 3.3s > > limit==16384: 3.5s > > > > (that's including downloading both versions from the server over https) > > > > So I'm wondering why it takes so long in your case. One thing to keep > > in mind here is that, since this is cpu intensive, optimizations might > > matter. I remember back in 2011 that I did most of my measurements > > with a "Release" build for example. But the above tests I did were > > with a debug build, so ... dunno. > > [ snip part about another optimization by Morten Kloster in r1140247, > which seemed not to work for Stefan. ] > > Coming back to this thread from last year, just wondering if you want > to pick this up again, Stefan (or anyone else), since a 1.15 release > might be happening in the near future. > > I think we already concluded that changing this behaviour in a patch > release would be hard to justify. But for 1.15 it might be acceptable > in some form. > > Just wanted to bring this back up, I have no vested interest here. I'm > still skeptical about changing the output of diff & blame in cases > where we hit "this takes too much work", but I won't stand in the way > if that what others consense on :-). If we want to pick a limit of the > maximum effort, then I'd suggest 16384 (or above), but that's a purely > personal / local suggestion, because the use case I had above would > then not be impacted. > > BTW: it's a pity that we can's say: "limit the diff effort in case of > 'update' or 'merge', but not in case of 'diff' or 'blame'". Because in > the latter case, the user might care more about the actual output, and > in the former they might not (until they hit a text conflict of > course, then they will care again ...) > > It might be interesting if someone could take a more "profiling" look > at why Stefan's example takes much more time, even with limit==8192 > (18 seconds), compared to my example (3.3 seconds). Is that purely > because of the total size of the files / number of lines? I find that > difference quite strange. There might be something that can be > optimized without changing behaviour. But I'm afraid I don't have time > to dig deeply into this.
My assumption by now is that the algorithm needs to be changed. This would essentially amount to a complete rewrite of the diff code. There must be a fundamental difference to how 'git diff' works, for example, which succeeds on the large files I have tested in a relatively short amount of time. Even 'got diff' (from my side-project gameoftrees.org), which does not focus on performance as a first-class feature, completes a diff of the problematic test case discussed in this thread in about 3 minutes. It uses a diff implemention written by Neels, with the Patience algorithm. I have shelved this topic for now. I have no desire to rewrite Subversion's diff code. And the original use case is an edge case where people are versioning large auto-generated XML files for reasons I don't understand. They worked around it by configuring 'git diff' as a diff tool for Subversion to use, and have not complained since. I suspect they could also change their process to work around this issue, but such changes are not always easy to get pushed through in a large organization. But, in the end, these are home-made problems which could be solved by using Subversion as it is intended to be used. Cheers, Stefan