[PATCH] Speed-up of libsvn_diff by restarts of LCS algorithm

Morten Kloster Fri, 27 May 2011 14:01:27 -0700

[[[
Speeds up LCS algorithm for well-separated sections of change
  By exploiting unique matches between the two files, it is possible to
  "restart" the LCS algorithm at times and reset the buildup of square
  computational complexity.


* subversion/libsvn_diff/lcs.c
  (svn_diff__snake_t): Added uniquematchcount member.
  (svn_diff__snake): Increments uniquematchcount each time
   a matching token occurs only once in each file.
  (clear_fp): New method that removes nonviable states.
  (svn_diff__lcs): Restarts LCS algorithm from p=0 if the number
   of unique matches exceeds the number of mismatches, using
   the first state where that occurs as the new start state.
]]]

This patch is a partial implementation of a possible enhancement to
the LCS algorithm. The patch builds on the patch in the thread
"[PATCH] Speed-up of libsvn_diff using token counts". I have only
included the lcs.c file in this patch; the other files in that other patch
must already be patched.


Theory behind the enhancement:
If the number of unique matches - tokens that only occur once in each
file - so far found exceeds the number of mismatches so far, and this
is the FIRST time that happens, then we are guaranteed that that
state is part of the true LCS, since eliminating some number of the
mismatches will necessarily add more new mismatches among the
current unique matches. We can thus "start over" from the current
state and thus stop the build-up of computational complexity
represented by the 'p' variable. By resetting also the number
of unique matches found to 0, we can repeat this process many
times.

This implementation is partial in the sense that it does not reach the
full potential of the enhancement. It will still scan at least 'd' states
for each value of 'p', whereas an optimal implementation would
interrupt the scan if there's a mismatch for what would have been
a unique match (since this always make the potential LCS worse,
regardless of whether the change is an insertion or a deletion).
However, a full implementation will be significantly more complicated,
as far as I can see.

The current implementation can reduce the average computational
complexity from O((p_max + d)*p_max) to O(sum((p_n + d)*p_n)),
while an optimal implementation would be O(sum((p_n + d_n)*p_n)).
p_max is here the total number of mismatches in the shorter file,
while p_n and d_n are the lesser number of mismatches and the
length difference between the files within each area of difference that
is separated from other areas of difference by a large number of
(not necessarily contiguous) unique matches. The current version
is only really useful if there is a similar number of mismatches in
each file, i.e. if there is a similar number of insertions and deletions
(not counting tokens that are unique to one of the files).

Johan: I initially thought it would be easiest to give unique matches
higher weight in the diff algorithm. This would have meant a change
in the definition of the optimal diff - rather than asking for simply the
largest number of matching tokes, we would prefer diffs with slightly
fewer total matches but more unique matches. It would not be a
heuristic, since we would still be a guaranteed optimal diff given the
new optimality criterion, but it would no longer be an LCS.
However, it turned out to be easier to keep the same weights.


The patch will need significant changes to integrate with the idx
patch (also libsvn_diff speed-up). The current version doesn't give
much speed improvement for most of my tests so far, but can
give very large improvements for "ideal" cases (many small areas
of non-unique changes that are separated by many unique matches,
and where the files are of equal length).

I'm not sure whether or not the current version (integrated with idx)
is worthwhile or not. I'll look into making a full implementation of the
enhancement.


Morten

Index: subversion/libsvn_diff/lcs.c
===================================================================
--- subversion/libsvn_diff/lcs.c        (revision 1128318)
+++ subversion/libsvn_diff/lcs.c        (working copy)
@@ -45,12 +45,14 @@
     apr_off_t             y;
     svn_diff__lcs_t      *lcs;
     svn_diff__position_t *position[2];
+    apr_int32_t           uniquematchcount;
 };
 
 static APR_INLINE void
 svn_diff__snake(apr_off_t k,
                 svn_diff__snake_t *fp,
                 int idx,
+                apr_uint32_t *token_counts[2],
                 svn_diff__lcs_t **freelist,
                 apr_pool_t *pool)
 {
@@ -58,6 +60,7 @@
   svn_diff__position_t *position[2];
   svn_diff__lcs_t *lcs;
   svn_diff__lcs_t *previous_lcs;
+  apr_int32_t uniquematchcount;
 
   /* The previous entry at fp[k] is going to be replaced.  See if we
    * can mark that lcs node for reuse, because the sequence up to this
@@ -80,6 +83,7 @@
     {
       start_position[0] = fp[k - 1].position[0];
       start_position[1] = fp[k - 1].position[1]->next;
+      uniquematchcount = fp[k - 1].uniquematchcount;
 
       previous_lcs = fp[k - 1].lcs;
     }
@@ -87,11 +91,17 @@
     {
       start_position[0] = fp[k + 1].position[0]->next;
       start_position[1] = fp[k + 1].position[1];
+      uniquematchcount = fp[k + 1].uniquematchcount;
 
       previous_lcs = fp[k + 1].lcs;
     }
 
 
+  if (previous_lcs)
+    {
+      previous_lcs->refcount++;
+    }
+
   /* ### Optimization, skip all positions that don't have matchpoints
    * ### anyway. Beware of the sentinel, don't skip it!
    */
@@ -99,8 +109,12 @@
   position[0] = start_position[0];
   position[1] = start_position[1];
 
+  while (1)
+    {
   while (position[0]->node == position[1]->node)
     {
+      if (token_counts[0][position[0]->node] == 1 && 
token_counts[1][position[0]->node] == 1)
+        uniquematchcount++;
       position[0] = position[0]->next;
       position[1] = position[1]->next;
     }
@@ -122,22 +136,24 @@
       lcs->length = position[1]->offset - start_position[1]->offset;
       lcs->next = previous_lcs;
       lcs->refcount = 1;
-      fp[k].lcs = lcs;
+      previous_lcs = lcs;
+      start_position[0] = position[0];
+      start_position[1] = position[1];
     }
-  else
-    {
-      fp[k].lcs = previous_lcs;
+      if (position[0]->node >= 0 && token_counts[1][position[0]->node] == 0)
+        start_position[0] = position[0] = position[0]->next;
+      else if (position[1]->node >= 0 && token_counts[0][position[1]->node] == 
0)
+        start_position[1] = position[1] = position[1]->next;
+      else
+        break;
     }
 
-  if (previous_lcs)
-    {
-      previous_lcs->refcount++;
-    }
-
+  fp[k].lcs = previous_lcs;
   fp[k].position[0] = position[0];
   fp[k].position[1] = position[1];
 
   fp[k].y = position[1]->offset;
+  fp[k].uniquematchcount = uniquematchcount;
 }
 
 
@@ -185,18 +201,62 @@
 }
 
 
+void
+clear_fp(svn_diff__snake_t *fp,
+         apr_off_t k,
+         apr_off_t min_index,
+         apr_off_t max_index,
+         svn_diff__lcs_t **lcs_freelist)
+{
+  svn_diff__snake_t best;
+  svn_diff__lcs_t *lcs, *prev_lcs;
+
+  best = fp[k];
+  best.lcs->refcount++;
+  for (k = min_index; k <= max_index; k++)
+    {
+      lcs = fp[k].lcs;
+      fp[k].y = 0;
+      fp[k].lcs = NULL;
+      fp[k].uniquematchcount = 0;
+      while (lcs)
+      {
+        lcs->refcount--;
+        if (lcs->refcount)
+        break;
+
+        prev_lcs = lcs->next;
+        lcs->next = lcs_freelist;
+        lcs_freelist = lcs;
+        lcs = prev_lcs;
+      }
+    }
+  fp[0] = best;
+  fp[0].uniquematchcount = 0;
+}
+
+
 svn_diff__lcs_t *
 svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to tail (ring) 
*/
               svn_diff__position_t *position_list2, /* pointer to tail (ring) 
*/
+              apr_uint32_t *token_counts_list1, /* array of counts */
+              apr_uint32_t *token_counts_list2, /* array of counts */
+              apr_uint32_t num_tokens,
               apr_off_t prefix_lines,
               apr_off_t suffix_lines,
               apr_pool_t *pool)
 {
   int idx;
   apr_off_t length[2];
+  apr_uint32_t *token_counts[2];
+  apr_uint32_t unique_count[2];
+  apr_uint32_t token_index;
   svn_diff__snake_t *fp;
   apr_off_t d;
   apr_off_t k;
+  apr_off_t min_index;
+  apr_off_t max_index;
+  apr_off_t oldk;
   apr_off_t p = 0;
   svn_diff__lcs_t *lcs, *lcs_freelist = NULL;
 
@@ -231,10 +291,19 @@
       return lcs;
     }
 
+  unique_count[1] = unique_count[0] = 0;
+  for (token_index = 0; token_index < num_tokens; token_index++)
+    {
+      if (token_counts_list1[token_index] == 0)
+        unique_count[1] += token_counts_list2[token_index];
+      if (token_counts_list2[token_index] == 0)
+        unique_count[0] += token_counts_list1[token_index];
+    }
+
   /* Calculate length of both sequences to be compared */
   length[0] = position_list1->offset - position_list1->next->offset + 1;
   length[1] = position_list2->offset - position_list2->next->offset + 1;
-  idx = length[0] > length[1] ? 1 : 0;
+  idx = length[0] - unique_count[0] > length[1] - unique_count[1] ? 1 : 0;
 
   /* strikerXXX: here we allocate the furthest point array, which is
    * strikerXXX: sized M + N + 3 (!)
@@ -246,18 +315,20 @@
   sentinel_position[idx].next = position_list1->next;
   position_list1->next = &sentinel_position[idx];
   sentinel_position[idx].offset = position_list1->offset + 1;
+  token_counts[idx] = token_counts_list1;
 
   sentinel_position[abs(1 - idx)].next = position_list2->next;
   position_list2->next = &sentinel_position[abs(1 - idx)];
   sentinel_position[abs(1 - idx)].offset = position_list2->offset + 1;
+  token_counts[abs(1 - idx)] = token_counts_list2;
 
-  /* These are never dereferenced, only compared by value, so we
-   * can safely fake these up and the void* cast is OK.
+  /* Negative indices will not be used elsewhere
    */
-  sentinel_position[0].node = (void*)&sentinel_position[0];
-  sentinel_position[1].node = (void*)&sentinel_position[1];
+  sentinel_position[0].node = -1;
+  sentinel_position[1].node = -2;
 
-  d = length[abs(1 - idx)] - length[idx];
+  d = length[abs(1 - idx)] - unique_count[abs(1 - idx)]
+      - (length[idx] - unique_count[idx]);
 
   /* k = -1 will be the first to be used to get previous
    * position information from, make sure it holds sane
@@ -269,17 +340,33 @@
   p = 0;
   do
     {
+         min_index = -p + (d<0 ? d : 0);
+         max_index = p + (d>0 ? d : 0);
       /* Forward */
-      for (k = -p; k < d; k++)
+      for (k = min_index; k < d; k++)
         {
-          svn_diff__snake(k, fp, idx, &lcs_freelist, pool);
+          svn_diff__snake(k, fp, idx, token_counts, &lcs_freelist, pool);
+          if ((d >= 0 && fp[k].uniquematchcount > k + 2 * p) || (d < 0 && 
fp[k].uniquematchcount > k + 2*(p - d)))
+            {
+              d -= k;
+              clear_fp(fp, k, min_index, max_index, &lcs_freelist);
+              p = 0;
+                         k = 0;
+                         max_index = (d>0 ? d : 0);
+            }
         }
 
-      for (k = d + p; k >= d; k--)
+      for (k = max_index; k >= d; k--)
         {
-          svn_diff__snake(k, fp, idx, &lcs_freelist, pool);
+          svn_diff__snake(k, fp, idx, token_counts, &lcs_freelist, pool);
+          if ((d <= 0 && fp[k].uniquematchcount > -k + 2 * p) || (d > 0 && 
fp[k].uniquematchcount > 2*(p + d) - k))
+            {
+              d -= k;
+              clear_fp(fp, k, min_index, max_index, &lcs_freelist);
+              p = 0;
+                         k = 0;
+            }
         }
-
       p++;
     }
   while (fp[d].position[1] != &sentinel_position[1]);

[PATCH] Speed-up of libsvn_diff by restarts of LCS algorithm

Reply via email to