On Tue, Jun 08, 2021 at 02:57:34PM +0200, Johan Corveleyn wrote: > Okay, I focused on the first revision causing the annotate to differ, > and with some debug logging: > - p went up to 139 > - length[0]=1942, length[1]=1817 > > Now, 1942 lines on the left and 1817 on the right doesn't seem all > that many (that's what remains after prefix-suffix stripping). I see > no reason 'svn diff' shouldn't be able to calculate a minimal diff for > those sizes in a reasonable amount of time. So if p can go up to such > a "cost" for such "reasonable" lenghts, I imagine we should put the > actual limit much higher. I suppose we can just as well set "min_cost" > to 1024 or even higher. 64 is way too low.
Thanks! I have set the value back to at least 1024 with this new version of patch. > BTW, the actual revision (and proper diff) here was not really > unreadable. It had 86 hunks. The log message was something like > "refactor contact parameters", and the change was removing 4 lines and > replacing them with 1 line, throughout an entire section. If the > lcs-limit gets hit, the entire "remaining part" (which normally would > have been some more small hunks) becomes one giant remove+add of 1000 > lines or so (i.e. quite useless for a diff that a human might want to > review). Indeed, we want to avoid such changes if possible, and only punish the cases which have problematic run-time behaviour. > Finally, a couple of small remarks related to the patch itself: > [[[ > + int max_cost; > + /* The minimum cost 'p' (# moves away from k=0) we are willing to invest. > */ > + const int min_cost = 64; > ]]] > > I get confused by the naming of the variables here (unless we use > min_max_cost or something). Perhaps something like 'cost_limit' and > 'min_cost_limit' or something like that? Sure, done. > [[[ > + max_cost = shift_sqrt(length[0] + length[1] / 2); > ]]] > > Is that correct WRT operator precedence? It looks like you want to > take the sqrt of the average length, which would be (length[0] + > length[1]) / 2. Or maybe I misunderstand. I'm not really sure why this > is used as an estimate for the "acceptable cost" / max_cost. But then > again, it's been a while since I read the details of Myers algorithm. Good catch. In my test case it doesn't make a difference. The limit ends up begin 2048 in either case. But you're right, the code was incorrect. [[[ * subversion/libsvn_diff/lcs.c (shift_sqrt): New helper function. Borrowed from Game of Trees' diff. (svn_diff__lcs): Limit the effort spent on finding an optimal diff to avoid spinning on the CPU for a long time (30 minutes or more) for some input files. Large XML documents were known to trigger this. ]] Index: subversion/libsvn_diff/lcs.c =================================================================== --- subversion/libsvn_diff/lcs.c (revision 1890390) +++ subversion/libsvn_diff/lcs.c (working copy) @@ -227,7 +227,18 @@ prepend_lcs(svn_diff__lcs_t *lcs, apr_off_t lines, return new_lcs; } +/* Integer square root approximation */ +static int +shift_sqrt(apr_off_t val) +{ + int i; + for (i = 1; val > 0; val >>= 2) + i <<= 1; + + return i; +} + svn_diff__lcs_t * svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to tail (ring) */ svn_diff__position_t *position_list2, /* pointer to tail (ring) */ @@ -245,7 +256,10 @@ svn_diff__lcs(svn_diff__position_t *position_list1 svn_diff__snake_t *fp; apr_off_t d; apr_off_t k; - apr_off_t p = 0; + apr_off_t p = 0; /* # moves away from k=0 */ + /* The minimum cost 'p' (# moves away from k=0) we are willing to invest. */ + const int min_cost_limit = 1024; + int max_cost_limit; svn_diff__lcs_t *lcs, *lcs_freelist = NULL; svn_diff__position_t sentinel_position[2]; @@ -337,6 +351,24 @@ svn_diff__lcs(svn_diff__position_t *position_list1 fp[d - 1].position[0] = sentinel_position[0].next; fp[d - 1].position[1] = &sentinel_position[1]; + /* Limit the effort spent looking for a diff snake. If files have + * very few lines in common, the effort spent to find nice diff + * snakes is just not worth it, the diff result will still be + * essentially minus everything on the left, plus everything on + * the right, with a few useless matches here and there. + * + * Very large files which have a lot of common lines interleaved with + * changed lines (such as humongous auto-generated XML documents) could + * easily keep us busy here for a very long time, in the order of hours. + * In such cases we abort the algorithm before it is finished and use + * the most recently computed intermediate result. The diff will not be + * pretty but a quick suboptimal result is better than a perfect result + * which takes hours to compute. + */ + max_cost_limit = shift_sqrt((length[0] + length[1]) / 2); + if (max_cost_limit < min_cost_limit) + max_cost_limit = min_cost_limit; + p = 0; do { @@ -353,7 +385,7 @@ svn_diff__lcs(svn_diff__position_t *position_list1 p++; } - while (fp[0].position[1] != &sentinel_position[1]); + while (fp[0].position[1] != &sentinel_position[1] && p < max_cost_limit); if (suffix_lines) lcs->next = prepend_lcs(fp[0].lcs, suffix_lines,