[PATCH] xdiff: reduce indent heuristic overhead

Stefan Beller Fri, 29 Jun 2018 16:38:03 -0700

From: Jun Wu <[email protected]>

This patch was written originally for mercurial at [1],
adding a limit on how long we'd be looking for an
optimal indent heuristic. Choose the limit high enough
to only limit edge cases.


    Adds some threshold to avoid expensive cases, like:

    ```
    #!python
    open('a', 'w').write(" \n" * 1000000)
    open('b', 'w').write(" \n" * 1000001)
    ```

    The indent heuristic is O(N * 20) (N = 1000000) for the above case, and
    makes diff much slower.

    Before this patch (system git 2.14.2):

    ```
    git diff --no-indent-heuristic a b  0.21s user 0.03s system 100% cpu 0.239 
total
    git diff --indent-heuristic a b     0.77s user 0.02s system 99% cpu 0.785 
total
    ```

    After this patch (git 2fc74f41, with xdiffi.c patched):

    ```
    # with the changed xdiffi.c
    git diff --indent-heuristic a b      0.16s user 0.01s system 90% cpu 0.188 
total
    git diff --no-indent-heuristic a b   0.18s user 0.01s system 99% cpu 0.192 
total
    ```

    Now turning on indent-heuristic has no visible impact on performance.

    Differential Revision: https://phab.mercurial-scm.org/D2601

[1] https://phab.mercurial-scm.org/rHGc420792217c89622482005c99e959b9071c109c5

Signed-off-by: Stefan Beller <[email protected]>
---

Jun, Junio

By changing the authorship we'd want to have a sign off from the original 
author,
before applying; in the previous attempt, I was merely taking the code from
mercurial as their copy of xdiff is also LGPLv2 so we are free to use that.

Thanks,
Stefan

 xdiff/xdiffi.c | 38 +++++++++++++++++++++++++++++++++++---
 1 file changed, 35 insertions(+), 3 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 0de1ef463bf..c74ec77da58 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -807,6 +807,14 @@ static void xdl_bug(const char *msg)
        exit(1);
 }
 
+/*
+ * For indentation heuristic, skip searching for better slide position after
+ * checking MAX_BORING lines without finding an improvement. This defends the
+ * indentation heuristic logic against pathological cases. The value is not
+ * picked scientifically but should be good enough.
+ */
+#define MAX_BORING 100
+
 /*
  * Move back and forward change groups for a consistent and pretty diff output.
  * This also helps in finding joinable change groups and reducing the diff
@@ -903,19 +911,43 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, 
long flags) {
                        long shift, best_shift = -1;
                        struct split_score best_score;
 
-                       for (shift = earliest_end; shift <= g.end; shift++) {
+                       /*
+                        * This is O(N * MAX_BLANKS) (N = shift-able lines).
+                        * Even with MAX_BLANKS bounded to a small value, a
+                        * large N could still make this loop take several
+                        * times longer than the main diff algorithm. The
+                        * "boring" value is to help cut down N to something
+                        * like (MAX_BORING + groupsize).
+                        *
+                        * Scan from bottom to top. So we can exit the loop
+                        * without compromising the assumption "for a same best
+                        * score, pick the bottommost shift".
+                        */
+                       int boring = 0;
+                       for (shift = g.end; shift >= earliest_end; shift--) {
                                struct split_measurement m;
                                struct split_score score = {0, 0};
+                               int cmp;
 
                                measure_split(xdf, shift, &m);
                                score_add_split(&m, &score);
                                measure_split(xdf, shift - groupsize, &m);
                                score_add_split(&m, &score);
-                               if (best_shift == -1 ||
-                                   score_cmp(&score, &best_score) <= 0) {
+
+                               if (best_shift == -1) {
+                                       cmp = -1;
+                               } else {
+                                       cmp = score_cmp(&score, &best_score);
+                               }
+                               if (cmp < 0) {
+                                       boring = 0;
                                        best_score.effective_indent = 
score.effective_indent;
                                        best_score.penalty = score.penalty;
                                        best_shift = shift;
+                               } else {
+                                       boring += 1;
+                                       if (boring >= MAX_BORING)
+                                               break;
                                }
                        }
 
-- 
2.18.0.399.gad0ab374a1-goog

[PATCH] xdiff: reduce indent heuristic overhead

Reply via email to