On Sun, Jun 6, 2021 at 4:57 PM Stefan Sperling <s...@elego.de> wrote: > > On Sat, Jun 05, 2021 at 08:56:24PM +0200, Johan Corveleyn wrote: > > Hmm, I tried this patch with my "annotate large XML file with deep > > history test", but the result isn't the same as with 1.14. I'd have to > > investigate a bit more to find out where it took another turn, and > > what that diff looks like (and whether or not the "other diff" is a > > problem for me). > > It would be good to know how many iterations your file needs to produce > a diff without a limit. You can use something like SVN_DBG(("p=%d\n", p)) > to print this number.
Okay, I focused on the first revision causing the annotate to differ, and with some debug logging: - p went up to 139 - length[0]=1942, length[1]=1817 Now, 1942 lines on the left and 1817 on the right doesn't seem all that many (that's what remains after prefix-suffix stripping). I see no reason 'svn diff' shouldn't be able to calculate a minimal diff for those sizes in a reasonable amount of time. So if p can go up to such a "cost" for such "reasonable" lenghts, I imagine we should put the actual limit much higher. I suppose we can just as well set "min_cost" to 1024 or even higher. 64 is way too low. BTW, the actual revision (and proper diff) here was not really unreadable. It had 86 hunks. The log message was something like "refactor contact parameters", and the change was removing 4 lines and replacing them with 1 line, throughout an entire section. If the lcs-limit gets hit, the entire "remaining part" (which normally would have been some more small hunks) becomes one giant remove+add of 1000 lines or so (i.e. quite useless for a diff that a human might want to review). To give you some more numbers about this large XML file: - it has 17701 revisions - it has almost 140,000 lines in the HEAD revision - it's a little less than 4 MB Annotating the full history (i.e. performing 17700 diffs) takes around 10 minutes, with or without your lcs-limit patch. So it's not like any of those 17700 diffs are taking a long time. I just tried running it with min_cost set to 160, and I still get differences. Next I'll try with 1024. Finally, a couple of small remarks related to the patch itself: [[[ + int max_cost; + /* The minimum cost 'p' (# moves away from k=0) we are willing to invest. */ + const int min_cost = 64; ]]] I get confused by the naming of the variables here (unless we use min_max_cost or something). Perhaps something like 'cost_limit' and 'min_cost_limit' or something like that? [[[ + max_cost = shift_sqrt(length[0] + length[1] / 2); ]]] Is that correct WRT operator precedence? It looks like you want to take the sqrt of the average length, which would be (length[0] + length[1]) / 2. Or maybe I misunderstand. I'm not really sure why this is used as an estimate for the "acceptable cost" / max_cost. But then again, it's been a while since I read the details of Myers algorithm. -- Johan