On Sun, Jun 6, 2021 at 4:57 PM Stefan Sperling <s...@elego.de> wrote:
>
> On Sat, Jun 05, 2021 at 08:56:24PM +0200, Johan Corveleyn wrote:
> > Hmm, I tried this patch with my "annotate large XML file with deep
> > history test", but the result isn't the same as with 1.14. I'd have to
> > investigate a bit more to find out where it took another turn, and
> > what that diff looks like (and whether or not the "other diff" is a
> > problem for me).
>
> It would be good to know how many iterations your file needs to produce
> a diff without a limit. You can use something like SVN_DBG(("p=%d\n", p))
> to print this number.

Okay, I focused on the first revision causing the annotate to differ,
and with some debug logging:
- p went up to 139
- length[0]=1942, length[1]=1817

Now, 1942 lines on the left and 1817 on the right doesn't seem all
that many (that's what remains after prefix-suffix stripping). I see
no reason 'svn diff' shouldn't be able to calculate a minimal diff for
those sizes in a reasonable amount of time. So if p can go up to such
a "cost" for such "reasonable" lenghts, I imagine we should put the
actual limit much higher. I suppose we can just as well set "min_cost"
to 1024 or even higher. 64 is way too low.

BTW, the actual revision (and proper diff) here was not really
unreadable. It had 86 hunks. The log message was something like
"refactor contact parameters", and the change was removing 4 lines and
replacing them with 1 line, throughout an entire section. If the
lcs-limit gets hit, the entire "remaining part" (which normally would
have been some more small hunks) becomes one giant remove+add of 1000
lines or so (i.e. quite useless for a diff that a human might want to
review).

To give you some more numbers about this large XML file:
- it has 17701 revisions
- it has almost 140,000 lines in the HEAD revision
- it's a little less than 4 MB

Annotating the full history (i.e. performing 17700 diffs) takes around
10 minutes, with or without your lcs-limit patch. So it's not like any
of those 17700 diffs are taking a long time. I just tried running it
with min_cost set to 160, and I still get differences. Next I'll try
with 1024.

Finally, a couple of small remarks related to the patch itself:
[[[
+  int max_cost;
+  /* The minimum cost 'p' (# moves away from k=0) we are willing to invest. */
+  const int min_cost = 64;
]]]

I get confused by the naming of the variables here (unless we use
min_max_cost or something). Perhaps something like 'cost_limit' and
'min_cost_limit' or something like that?

[[[
+  max_cost = shift_sqrt(length[0] + length[1] / 2);
]]]

Is that correct WRT operator precedence? It looks like you want to
take the sqrt of the average length, which would be (length[0] +
length[1]) / 2. Or maybe I misunderstand. I'm not really sure why this
is used as an estimate for the "acceptable cost" / max_cost. But then
again, it's been a while since I read the details of Myers algorithm.

-- 
Johan

Reply via email to