On Tue, Jun 08, 2021 at 02:57:34PM +0200, Johan Corveleyn wrote:
> Okay, I focused on the first revision causing the annotate to differ,
> and with some debug logging:
> - p went up to 139
> - length[0]=1942, length[1]=1817
> 
> Now, 1942 lines on the left and 1817 on the right doesn't seem all
> that many (that's what remains after prefix-suffix stripping). I see
> no reason 'svn diff' shouldn't be able to calculate a minimal diff for
> those sizes in a reasonable amount of time. So if p can go up to such
> a "cost" for such "reasonable" lenghts, I imagine we should put the
> actual limit much higher. I suppose we can just as well set "min_cost"
> to 1024 or even higher. 64 is way too low.

Thanks!
I have set the value back to at least 1024 with this new version of patch.

> BTW, the actual revision (and proper diff) here was not really
> unreadable. It had 86 hunks. The log message was something like
> "refactor contact parameters", and the change was removing 4 lines and
> replacing them with 1 line, throughout an entire section. If the
> lcs-limit gets hit, the entire "remaining part" (which normally would
> have been some more small hunks) becomes one giant remove+add of 1000
> lines or so (i.e. quite useless for a diff that a human might want to
> review).

Indeed, we want to avoid such changes if possible, and only punish
the cases which have problematic run-time behaviour.

> Finally, a couple of small remarks related to the patch itself:
> [[[
> +  int max_cost;
> +  /* The minimum cost 'p' (# moves away from k=0) we are willing to invest. 
> */
> +  const int min_cost = 64;
> ]]]
> 
> I get confused by the naming of the variables here (unless we use
> min_max_cost or something). Perhaps something like 'cost_limit' and
> 'min_cost_limit' or something like that?

Sure, done.

> [[[
> +  max_cost = shift_sqrt(length[0] + length[1] / 2);
> ]]]
> 
> Is that correct WRT operator precedence? It looks like you want to
> take the sqrt of the average length, which would be (length[0] +
> length[1]) / 2. Or maybe I misunderstand. I'm not really sure why this
> is used as an estimate for the "acceptable cost" / max_cost. But then
> again, it's been a while since I read the details of Myers algorithm.

Good catch. In my test case it doesn't make a difference. The limit ends
up begin 2048 in either case. But you're right, the code was incorrect.

[[[
* subversion/libsvn_diff/lcs.c
  (shift_sqrt): New helper function. Borrowed from Game of Trees' diff.
  (svn_diff__lcs): Limit the effort spent on finding an optimal diff to
   avoid spinning on the CPU for a long time (30 minutes or more) for
   some input files. Large XML documents were known to trigger this.
]]

Index: subversion/libsvn_diff/lcs.c
===================================================================
--- subversion/libsvn_diff/lcs.c        (revision 1890390)
+++ subversion/libsvn_diff/lcs.c        (working copy)
@@ -227,7 +227,18 @@ prepend_lcs(svn_diff__lcs_t *lcs, apr_off_t lines,
   return new_lcs;
 }
 
+/* Integer square root approximation */
+static int
+shift_sqrt(apr_off_t val)
+{
+  int i;
 
+  for (i = 1; val > 0; val >>= 2)
+    i <<= 1;
+
+  return i;
+}
+
 svn_diff__lcs_t *
 svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to tail (ring) 
*/
               svn_diff__position_t *position_list2, /* pointer to tail (ring) 
*/
@@ -245,7 +256,10 @@ svn_diff__lcs(svn_diff__position_t *position_list1
   svn_diff__snake_t *fp;
   apr_off_t d;
   apr_off_t k;
-  apr_off_t p = 0;
+  apr_off_t p = 0; /* # moves away from k=0 */
+  /* The minimum cost 'p' (# moves away from k=0) we are willing to invest. */
+  const int min_cost_limit = 1024;
+  int max_cost_limit;
   svn_diff__lcs_t *lcs, *lcs_freelist = NULL;
 
   svn_diff__position_t sentinel_position[2];
@@ -337,6 +351,24 @@ svn_diff__lcs(svn_diff__position_t *position_list1
   fp[d - 1].position[0] = sentinel_position[0].next;
   fp[d - 1].position[1] = &sentinel_position[1];
 
+  /* Limit the effort spent looking for a diff snake. If files have
+   * very few lines in common, the effort spent to find nice diff
+   * snakes is just not worth it, the diff result will still be
+   * essentially minus everything on the left, plus everything on
+   * the right, with a few useless matches here and there.
+   *
+   * Very large files which have a lot of common lines interleaved with
+   * changed lines (such as humongous auto-generated XML documents) could
+   * easily keep us busy here for a very long time, in the order of hours.
+   * In such cases we abort the algorithm before it is finished and use
+   * the most recently computed intermediate result. The diff will not be
+   * pretty but a quick suboptimal result is better than a perfect result
+   * which takes hours to compute.
+   */
+  max_cost_limit = shift_sqrt((length[0] + length[1]) / 2);
+  if (max_cost_limit < min_cost_limit)
+    max_cost_limit = min_cost_limit;
+
   p = 0;
   do
     {
@@ -353,7 +385,7 @@ svn_diff__lcs(svn_diff__position_t *position_list1
 
       p++;
     }
-  while (fp[0].position[1] != &sentinel_position[1]);
+  while (fp[0].position[1] != &sentinel_position[1] && p < max_cost_limit);
 
   if (suffix_lines)
     lcs->next = prepend_lcs(fp[0].lcs, suffix_lines,

Reply via email to