Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Johan Corveleyn Tue, 19 Oct 2010 15:42:16 -0700

On Tue, Oct 12, 2010 at 12:10 PM, Julian Foad <julian.f...@wandisco.com> wrote:
> On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote:
>> On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad <julian.f...@wandisco.com> 
>> wrote:
>> > On Sat, 2010-10-09, Johan Corveleyn wrote:
>> >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <julian.f...@wandisco.com> 
>> >> wrote:
>> >> > So I wrote a patch - attached - that refactors this into an array of 4
>> >> > sub-structures, and simplifies all the code that uses them.
>> > [...]
>> >> Yes, great idea! That would indeed vastly simplify a lot of the code.
>> >> So please go ahead and commit the refactoring.
>> >
>> > OK, committed in r1021282.
>>
>> Thanks, looks much more manageable now.
>
> I'd like to see a simplified version of your last patch, taking
> advantage of that, before you go exploring other options.


Ok, here's a new version of the patch, taking advantage of your
file_info refactoring. This vastly simplifies the code, so that it
might actually be understandable now :-).

Other things I've done in this version:
1) Generalized everything to handle an array of datasources/files,
instead of just two. This makes it slightly more complex here and
there (using for loops everywhere), but I think it's ok, and it's also
more consistent/generic. If anyone has better ideas to do those for
loops, suggestions welcome.

This makes the algorithm usable by diff3 as well (and diff4 if needed
(?)). I have not yet enabled it for diff3, because I haven't yet
understood how it handles the generation of its diff output (needs to
take into account the prefix_lines. I tried some quick hacks, but lots
of tests were failing, so I'll have to look more into it -> that's for
a follow up patch). When I can enable it for diff3 (and diff4), I can
remove datasource_open (with one datasource).

2) Removed get_prefix_lines from svn_diff_fns_t (and its
implementations in diff_file.c and diff_memory.c). Instead I pass
prefix_lines directly to token.c#svn_diff__get_tokens.

3) If prefix scanning ended in the last chunk, the suffix scanning now
reuses that buffer which already contains the last chunk. As a special
case, this also avoids reading the file twice if it's smaller than 128
Kb.

4) Added doc strings everywhere. Feel free to edit those, I'm new at
documenting things in svn.


Still TODO:
- revv svn_diff_fns_t and maybe other stuff I've changed in public API.
- See if implementing the critical parts of increment_pointers and
decrement_pointers in a macro improves performance.
- Add support for -x-b, -x-w, and -x--ignore-eol-style options. For
this (and for other reasons), I'd still like to investigate pushing
this optimization into the token parsing/handling layer, to extract
entire tokens etc., even if this means the current patch has to be
thrown away. I'll take this up in a separate thread.

Log message:
[[[
Make svn_diff skip identical prefix and suffix to make diff and blame faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function type datasources_open to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (svn_diff__mem_vtable): Added new function datasources_open.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added member prefix_lines, and inside the
   struct file_info the members suffix_start_chunk and suffix_offset_in_chunk.
  (increment_pointers, decrement_pointers): New functions.
  (is_one_at_bof, is_one_at_eof): New functions.
  (find_identical_prefix, find_identical_suffix): New functions.
  (datasources_open): New function, to open multiple datasources and find
   their identical prefix and suffix, so these can be excluded from the rest
   of the diff algorithm, as a performance optimization. From the identical
   suffix, 50 lines are kept to help the diff algorithm find the nicest
   possible diff representation in case of ambiguity.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new function datasources_open.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that
   the datasource was already opened, and argument "prefix_lines", the number
   of identical prefix lines.and use
   this as the starting offset for the token we're getting.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added arguments "datasource_opened" and
   "prefix_lines". Only open the datasource if datasource_opened is FALSE.
   Set the starting offset of the position list to the number of prefix_lines.

* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set
   the offset of the sentinel position for EOF, even if one of the files
   became empty after eliminating the identical prefix.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of "common" diff for identical prefix.
  (svn_diff_diff): Use new function datasources_open to open original and
   modified at once and find their identical prefix and suffix. Pass
   prefix_lines to svn_diff__get_tokens, svn_diff__lcs and to svn_diff__diff.

* subversion/libsvn_diff/diff3.c
  (svn_diff_diff3): Pass datasource_opened = FALSE and prefix_lines = 0 to
   svn_diff__get_tokens. Pass prefix_lines = 0 to svn_diff__lcs.

* subversion/libsvn_diff/diff4.c
  (svn_diff_diff4): Pass datasource_opened = FALSE and prefix_lines = 0 to
   svn_diff__get_tokens. Pass prefix_lines = 0 to svn_diff__lcs.
]]]

Cheers,
-- 
Johan

Index: subversion/include/svn_diff.h
===================================================================
--- subversion/include/svn_diff.h       (revision 1023075)
+++ subversion/include/svn_diff.h       (working copy)
@@ -112,6 +112,11 @@ typedef struct svn_diff_fns_t
   svn_error_t *(*datasource_open)(void *diff_baton,
                                   svn_diff_datasource_e datasource);
 
+  /** Open the datasources of type @a datasources. */
+  svn_error_t *(*datasources_open)(void *diff_baton, apr_off_t *prefix_lines,
+                                   svn_diff_datasource_e datasource[],
+                                   int datasource_len);
+
   /** Close the datasource of type @a datasource. */
   svn_error_t *(*datasource_close)(void *diff_baton,
                                    svn_diff_datasource_e datasource);
Index: subversion/libsvn_diff/diff_memory.c
===================================================================
--- subversion/libsvn_diff/diff_memory.c        (revision 1023075)
+++ subversion/libsvn_diff/diff_memory.c        (working copy)
@@ -95,6 +95,15 @@ datasource_open(void *baton, svn_diff_datasource_e
   return SVN_NO_ERROR;
 }
 
+/* Implements svn_diff_fns_t::datasources_open */
+static svn_error_t *
+datasources_open(void *baton, apr_off_t *prefix_lines,
+                 svn_diff_datasource_e datasource[], 
+                 int datasource_len)
+{
+  /* Do nothing: everything is already there and initialized to 0 */
+  return SVN_NO_ERROR;
+}
 
 /* Implements svn_diff_fns_t::datasource_close */
 static svn_error_t *
@@ -189,6 +198,7 @@ token_discard_all(void *baton)
 static const svn_diff_fns_t svn_diff__mem_vtable =
 {
   datasource_open,
+  datasources_open,
   datasource_close,
   datasource_get_next_token,
   token_compare,
Index: subversion/libsvn_diff/diff_file.c
===================================================================
--- subversion/libsvn_diff/diff_file.c  (revision 1023075)
+++ subversion/libsvn_diff/diff_file.c  (working copy)
@@ -82,8 +82,15 @@ typedef struct svn_diff__file_baton_t
     char *endp;    /* next memory address after the current chunk */
 
     svn_diff__normalize_state_t normalize_state;
+
+    /* Where the identical suffix starts in this datasource */
+    int suffix_start_chunk;
+    apr_off_t suffix_offset_in_chunk;
   } files[4];
 
+  /* Number of prefix lines identical between all datasources */
+  apr_off_t prefix_lines;
+
   /* List of free tokens that may be reused. */
   svn_diff__file_token_t *tokens;
 
@@ -242,7 +249,388 @@ datasource_open(void *baton, svn_diff_datasource_e
                     curp, length, 0, file_baton->pool);
 }
 
+/* For all files in the FILE array, increment the curp pointer.  If a file
+ * points before the beginning of file, let it point at the first byte again.
+ * If the end of the current chunk is reached, read the next chunk in the
+ * buffer and point curp to the start of the chunk.  If EOF is reached, set
+ * curp equal to endp to indicate EOF. */
+static svn_error_t *
+increment_pointers(struct file_info *file[], int file_len, apr_pool_t *pool)
+{
+  int i;
 
+  for (i = 0; i < file_len; i++)
+    if (file[i]->chunk == -1) /* indicates before beginning of file */
+      {
+        file[i]->chunk = 0; /* point to beginning of file again */
+      }
+    else if (file[i]->curp == file[i]->endp - 1)
+      {
+        apr_off_t last_chunk = offset_to_chunk(file[i]->size);
+        if (file[i]->chunk == last_chunk)
+          {
+            file[i]->curp++; /* curp == endp signals end of file */
+          }
+        else
+          {
+            apr_off_t length;
+            file[i]->chunk++;
+            length = file[i]->chunk == last_chunk ? 
+              offset_in_chunk(file[i]->size) : CHUNK_SIZE;
+            SVN_ERR(read_chunk(file[i]->file, file[i]->path, file[i]->buffer,
+                               length, chunk_to_offset(file[i]->chunk),
+                               pool));
+            file[i]->endp = file[i]->buffer + length;
+            file[i]->curp = file[i]->buffer;
+          }
+      }
+    else
+      {
+        file[i]->curp++;
+      }
+
+  return SVN_NO_ERROR;
+}
+
+/* For all files in the FILE array, decrement the curp pointer.  If the
+ * start of a chunk is reached, read the previous chunk in the buffer and
+ * point curp to the last byte of the chunk.  If the beginning of a FILE is
+ * reached, set chunk to -1 to indicate BOF. */
+static svn_error_t *
+decrement_pointers(struct file_info *file[], int file_len, apr_pool_t *pool)
+{
+  int i;
+
+  for (i = 0; i < file_len; i++)
+    if (file[i]->curp == file[i]->buffer)
+      {
+        if (file[i]->chunk == 0)
+          file[i]->chunk--; /* chunk == -1 signals beginning of file */
+        else
+          {
+            file[i]->chunk--;
+            SVN_ERR(read_chunk(file[i]->file, file[i]->path, file[i]->buffer,
+                               CHUNK_SIZE, chunk_to_offset(file[i]->chunk),
+                               pool));
+            file[i]->endp = file[i]->buffer + CHUNK_SIZE;
+            file[i]->curp = file[i]->endp - 1;
+          }
+      }
+    else
+      {
+        file[i]->curp--;
+      }
+
+  return SVN_NO_ERROR;
+}
+
+/* Check whether one of the FILEs has its pointers 'before' the beginning of
+ * the file (this can happen while scanning backwards). This is the case if
+ * one of them has chunk == -1. */
+static svn_boolean_t
+is_one_at_bof(struct file_info *file[], int file_len)
+{
+  int i;
+
+  for (i = 0; i < file_len; i++)
+    if (file[i]->chunk == -1)
+      return TRUE;
+
+  return FALSE;
+}
+
+/* Check whether one of the FILEs has its pointers at EOF (this is the case if
+ * one of them has curp == endp (this can only happen at the last chunk)) */
+static svn_boolean_t
+is_one_at_eof(struct file_info *file[], int file_len)
+{
+  int i;
+
+  for (i = 0; i < file_len; i++)
+    if (file[i]->curp == file[i]->endp)
+      return TRUE;
+
+  return FALSE;
+}
+
+/* Find the prefix which is identical between all elements of the FILE array.
+ * Return the number of prefix lines in PREFIX_LINES.  REACHED_ONE_EOF will be
+ * set to TRUE if one of the FILEs reached its end while scanning prefix,
+ * i.e. at least one file consisted entirely of prefix.  Otherwise, 
+ * REACHED_ONE_EOF is set to FALSE.
+ *
+ * After this function is finished, the buffers, chunks, curp's and endp's 
+ * of the FILEs are set to point at the first byte after the prefix. */
+static svn_error_t *
+find_identical_prefix(svn_boolean_t *reached_one_eof, apr_off_t *prefix_lines,
+                      struct file_info *file[], int file_len,
+                      apr_pool_t *pool)
+{
+  svn_boolean_t had_cr = FALSE;
+  svn_boolean_t is_match, reached_all_eof;
+  int i;
+
+  *prefix_lines = 0;
+  for (i = 1, is_match = TRUE; i < file_len; i++)
+    is_match = is_match && *file[0]->curp == *file[i]->curp;
+  while (is_match)
+    {
+      /* ### TODO: see if we can take advantage of 
+         diff options like ignore_eol_style or ignore_space. */
+      /* check for eol, and count */
+      if (*file[0]->curp == '\r')
+        {
+          (*prefix_lines)++;
+          had_cr = TRUE;
+        }
+      else if (*file[0]->curp == '\n' && !had_cr)
+        {
+          (*prefix_lines)++;
+          had_cr = FALSE;
+        }
+      else 
+        {
+          had_cr = FALSE;
+        }
+
+      SVN_ERR(increment_pointers(file, file_len, pool));
+
+      /* curp == endp indicates EOF (this can only happen with last chunk) */
+      *reached_one_eof = is_one_at_eof(file, file_len);
+      if (*reached_one_eof)
+        break;
+      else
+        for (i = 1, is_match = TRUE; i < file_len; i++)
+          is_match = is_match && *file[0]->curp == *file[i]->curp;
+    }
+
+  /* If all files reached their end (i.e. are fully identical), we're done */
+  for (i = 0, reached_all_eof = TRUE; i < file_len; i++)
+    reached_all_eof = reached_all_eof && file[i]->curp == file[i]->endp;
+  if (reached_all_eof)
+    return SVN_NO_ERROR;
+
+  if (had_cr)
+    {
+      /* Check if we ended in the middle of a \r\n for one file, but \r for 
+         another. If so, back up one byte, so the next loop will back up
+         the entire line. Also decrement *prefix_lines, since we counted one
+         too many for the \r. */
+      svn_boolean_t ended_at_nonmatching_newline = FALSE;
+      for (i = 0; i < file_len; i++)
+        ended_at_nonmatching_newline = ended_at_nonmatching_newline 
+                                       || *file[i]->curp == '\n';
+      if (ended_at_nonmatching_newline)
+        {
+          (*prefix_lines)--;
+          SVN_ERR(decrement_pointers(file, file_len, pool));
+        }
+    }
+
+  /* Back up one byte, so we point at the last identical byte */
+  SVN_ERR(decrement_pointers(file, file_len, pool));
+
+  /* Back up to the last eol sequence (\n, \r\n or \r) */
+  while (!is_one_at_bof(file, file_len) && 
+         *file[0]->curp != '\n' && *file[0]->curp != '\r')
+    SVN_ERR(decrement_pointers(file, file_len, pool));
+
+  /* Slide one byte forward, to point past the eol sequence */
+  SVN_ERR(increment_pointers(file, file_len, pool));
+
+  return SVN_NO_ERROR;
+}
+
+#define SUFFIX_LINES_TO_KEEP 50
+
+/* Find the suffix which is identical between all elements of the FILE array.
+ *
+ * Before this function is called the FILEs' pointers and chunks should be 
+ * positioned right after the identical prefix (which is the case after 
+ * find_identical_prefix), so we can determine where suffix scanning should 
+ * ultimately stop. */
+static svn_error_t *
+find_identical_suffix(struct file_info *file[], int file_len,
+                      apr_pool_t *pool)
+{
+  struct file_info file_for_suffix[4];
+  struct file_info *file_for_suffix_ptr[4];
+  apr_off_t length[4];
+  apr_off_t suffix_min_chunk0;
+  apr_off_t suffix_min_offset0;
+  apr_off_t min_file_size;
+  int suffix_lines_to_keep = SUFFIX_LINES_TO_KEEP;
+  svn_boolean_t is_match, reached_prefix;
+  int i;
+
+  for (i = 0; i < file_len; i++)
+    {
+      memset(&file_for_suffix[i], 0, sizeof(file_for_suffix[i]));
+      file_for_suffix_ptr[i] = &file_for_suffix[i];
+    }
+
+  /* Initialize file_for_suffix[].
+     Read last chunk, position curp at last byte. */
+  for (i = 0; i < file_len; i++)
+    {
+      file_for_suffix[i].path = file[i]->path;
+      file_for_suffix[i].file = file[i]->file;
+      file_for_suffix[i].size = file[i]->size;
+      file_for_suffix[i].chunk =
+        (int) offset_to_chunk(file_for_suffix[i].size); /* last chunk */
+      length[i] = offset_in_chunk(file_for_suffix[i].size);
+      if (file_for_suffix[i].chunk == file[i]->chunk)
+        {
+          /* Prefix ended in last chunk, so we can reuse the prefix buffer */
+          file_for_suffix[i].buffer = file[i]->buffer;
+        }
+      else
+        {
+          /* There is at least more than 1 chunk,
+             so allocate full chunk size buffer */
+          file_for_suffix[i].buffer = apr_palloc(pool, CHUNK_SIZE);
+          SVN_ERR(read_chunk(file_for_suffix[i].file, file_for_suffix[i].path,
+                             file_for_suffix[i].buffer, length[i],
+                             chunk_to_offset(file_for_suffix[i].chunk),
+                             pool));
+        }
+      file_for_suffix[i].endp = file_for_suffix[i].buffer + length[i];
+      file_for_suffix[i].curp = file_for_suffix[i].endp - 1;
+    }
+
+  /* Get the chunk and pointer offset (for file[0]) at which we should stop
+     scanning backward for the identical suffix, i.e. when we reach prefix. */
+  suffix_min_chunk0 = file[0]->chunk;
+  suffix_min_offset0 = file[0]->curp - file[0]->buffer;
+
+  /* Compensate if other files are smaller than file[0] */
+  for (i = 1, min_file_size = file[0]->size; i < file_len; i++)
+    if (file[i]->size < min_file_size)
+      min_file_size = file[i]->size;
+  if (file[0]->size > min_file_size)
+    {
+      suffix_min_chunk0 += (file[0]->size - min_file_size) / CHUNK_SIZE;
+      suffix_min_offset0 += (file[0]->size - min_file_size) % CHUNK_SIZE;
+    }
+
+  /* Scan backwards until mismatch or until we reach the prefix. */
+  for (i = 1, is_match = TRUE; i < file_len; i++)
+    is_match =
+      is_match && *file_for_suffix[0].curp == *file_for_suffix[i].curp;
+  while (is_match)
+    {
+      SVN_ERR(decrement_pointers(file_for_suffix_ptr, file_len, pool));
+      
+      reached_prefix = file_for_suffix[0].chunk == suffix_min_chunk0 
+                       && (file_for_suffix[0].curp - file_for_suffix[0].buffer)
+                          == suffix_min_offset0;
+
+      if (reached_prefix || is_one_at_bof(file_for_suffix_ptr, file_len))
+        break;
+      else
+        for (i = 1, is_match = TRUE; i < file_len; i++)
+          is_match =
+            is_match && *file_for_suffix[0].curp == *file_for_suffix[i].curp;
+    }
+
+  /* Slide one byte forward, to point at the first byte of identical suffix */
+  SVN_ERR(increment_pointers(file_for_suffix_ptr, file_len, pool));
+
+  /* Slide forward until we find an eol sequence to add the rest of the line
+     we're in. Then add SUFFIX_LINES_TO_KEEP more lines. Stop if at least 
+     one file reaches its end. */
+  do
+    {
+      while (!is_one_at_eof(file_for_suffix_ptr, file_len)
+             && *file_for_suffix[0].curp != '\n'
+             && *file_for_suffix[0].curp != '\r')
+        SVN_ERR(increment_pointers(file_for_suffix_ptr, file_len, pool));
+
+      /* Slide one or two more bytes, to point past the eol. */
+      if (!is_one_at_eof(file_for_suffix_ptr, file_len)
+          && *file_for_suffix[0].curp == '\r')
+        SVN_ERR(increment_pointers(file_for_suffix_ptr, file_len, pool));
+      if (!is_one_at_eof(file_for_suffix_ptr, file_len)
+          && *file_for_suffix[0].curp == '\n')
+        SVN_ERR(increment_pointers(file_for_suffix_ptr, file_len, pool));
+    }
+  while (!is_one_at_eof(file_for_suffix_ptr, file_len) 
+         && suffix_lines_to_keep--);
+
+  /* Save the final suffix information in the original file_info */
+  for (i = 0; i < file_len; i++)
+    {
+      file[i]->suffix_start_chunk = file_for_suffix[i].chunk;
+      file[i]->suffix_offset_in_chunk = 
+        file_for_suffix[i].curp - file_for_suffix[i].buffer;
+    }
+
+  return SVN_NO_ERROR;
+}
+
+/* Let FILE stand for the array of file_info struct elements of BATON->files
+ * that are indexed by the elements of the DATASOURCE array.
+ * BATON's type is (svn_diff__file_baton_t *).
+ *
+ * For each file in the FILE array, open the file at FILE.path; initialize 
+ * FILE.file, FILE.size, FILE.buffer, FILE.curp and FILE.endp; allocate a 
+ * buffer and read the first chunk.  Then find the prefix and suffix lines
+ * which are identical between all the files.  Return the number of identical
+ * prefix lines in PREFIX_LINES.
+ *
+ * Finding the identical prefix and suffix allows us to exclude those from the
+ * rest of the diff algorithm, which increases performance by reducing the 
+ * problem space.
+ *
+ * Implements svn_diff_fns_t::datasources_open. */
+static svn_error_t *
+datasources_open(void *baton, apr_off_t *prefix_lines,
+                 svn_diff_datasource_e datasource[],
+                 int datasource_len)
+{
+  svn_diff__file_baton_t *file_baton = baton;
+  struct file_info *file[4];
+  apr_finfo_t finfo[4];
+  apr_off_t length[4];
+  svn_boolean_t reached_one_eof;
+  int i;
+
+  /* Open datasources and read first chunk */
+  for (i = 0; i < datasource_len; i++)
+    {
+      file[i] = &file_baton->files[datasource_to_index(datasource[i])];
+      SVN_ERR(svn_io_file_open(&file[i]->file, file[i]->path,
+                               APR_READ, APR_OS_DEFAULT, file_baton->pool));
+      SVN_ERR(svn_io_file_info_get(&finfo[i], APR_FINFO_SIZE,
+                                   file[i]->file, file_baton->pool));
+      file[i]->size = finfo[i].size;
+      length[i] = finfo[i].size > CHUNK_SIZE ? CHUNK_SIZE : finfo[i].size;
+      file[i]->buffer = apr_palloc(file_baton->pool, (apr_size_t) length[i]);
+      SVN_ERR(read_chunk(file[i]->file, file[i]->path, file[i]->buffer,
+                         length[i], 0, file_baton->pool));
+      file[i]->endp = file[i]->buffer + length[i];
+      file[i]->curp = file[i]->buffer;
+    }
+
+  for (i = 0; i < datasource_len; i++)
+    if (length[i] == 0)
+      /* There will not be any identical prefix/suffix, so we're done. */
+      return SVN_NO_ERROR;
+
+  SVN_ERR(find_identical_prefix(&reached_one_eof, prefix_lines,
+                                file, datasource_len, file_baton->pool));
+  file_baton->prefix_lines = *prefix_lines;
+
+  if (reached_one_eof)
+    /* At least one file consisted totally of identical prefix, 
+     * so there will be no identical suffix. We're done. */
+    return SVN_NO_ERROR;
+
+  SVN_ERR(find_identical_suffix(file, datasource_len, file_baton->pool));
+
+  return SVN_NO_ERROR;
+}
+
 /* Implements svn_diff_fns_t::datasource_close */
 static svn_error_t *
 datasource_close(void *baton, svn_diff_datasource_e datasource)
@@ -284,6 +672,12 @@ datasource_get_next_token(apr_uint32_t *hash, void
       return SVN_NO_ERROR;
     }
 
+  /* If identical suffix is defined, stop when we encounter it */
+  if (file->suffix_start_chunk || file->suffix_offset_in_chunk)
+    if (file->chunk == file->suffix_start_chunk
+        && (curp - file->buffer) == file->suffix_offset_in_chunk)
+      return SVN_NO_ERROR;
+
   /* Get a new token */
   file_token = file_baton->tokens;
   if (file_token)
@@ -533,6 +927,7 @@ token_discard_all(void *baton)
 static const svn_diff_fns_t svn_diff__file_vtable =
 {
   datasource_open,
+  datasources_open,
   datasource_close,
   datasource_get_next_token,
   token_compare,
Index: subversion/libsvn_diff/diff.h
===================================================================
--- subversion/libsvn_diff/diff.h       (revision 1023075)
+++ subversion/libsvn_diff/diff.h       (working copy)
@@ -91,6 +91,7 @@ typedef enum svn_diff__normalize_state_t
 svn_diff__lcs_t *
 svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to tail (ring) 
*/
               svn_diff__position_t *position_list2, /* pointer to tail (ring) 
*/
+              apr_off_t prefix_lines,
               apr_pool_t *pool);
 
 
@@ -111,6 +112,8 @@ svn_diff__get_tokens(svn_diff__position_t **positi
                      void *diff_baton,
                      const svn_diff_fns_t *vtable,
                      svn_diff_datasource_e datasource,
+                     svn_boolean_t datasource_opened,
+                     apr_off_t prefix_lines,
                      apr_pool_t *pool);
 
 
Index: subversion/libsvn_diff/token.c
===================================================================
--- subversion/libsvn_diff/token.c      (revision 1023075)
+++ subversion/libsvn_diff/token.c      (working copy)
@@ -139,6 +139,8 @@ svn_diff__get_tokens(svn_diff__position_t **positi
                      void *diff_baton,
                      const svn_diff_fns_t *vtable,
                      svn_diff_datasource_e datasource,
+                     svn_boolean_t datasource_opened,
+                     apr_off_t prefix_lines,
                      apr_pool_t *pool)
 {
   svn_diff__position_t *start_position;
@@ -152,10 +154,11 @@ svn_diff__get_tokens(svn_diff__position_t **positi
   *position_list = NULL;
 
 
-  SVN_ERR(vtable->datasource_open(diff_baton, datasource));
+  if (!datasource_opened)
+    SVN_ERR(vtable->datasource_open(diff_baton, datasource));
 
   position_ref = &start_position;
-  offset = 0;
+  offset = prefix_lines;
   hash = 0; /* The callback fn doesn't need to touch it per se */
   while (1)
     {
Index: subversion/libsvn_diff/lcs.c
===================================================================
--- subversion/libsvn_diff/lcs.c        (revision 1023075)
+++ subversion/libsvn_diff/lcs.c        (working copy)
@@ -163,6 +163,7 @@ svn_diff__lcs_reverse(svn_diff__lcs_t *lcs)
 svn_diff__lcs_t *
 svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to tail (ring) 
*/
               svn_diff__position_t *position_list2, /* pointer to tail (ring) 
*/
+              apr_off_t prefix_lines,
               apr_pool_t *pool)
 {
   int idx;
@@ -180,9 +181,11 @@ svn_diff__lcs(svn_diff__position_t *position_list1
    */
   lcs = apr_palloc(pool, sizeof(*lcs));
   lcs->position[0] = apr_pcalloc(pool, sizeof(*lcs->position[0]));
-  lcs->position[0]->offset = position_list1 ? position_list1->offset + 1 : 1;
+  lcs->position[0]->offset = position_list1 ? 
+    position_list1->offset + 1 : prefix_lines + 1;
   lcs->position[1] = apr_pcalloc(pool, sizeof(*lcs->position[1]));
-  lcs->position[1]->offset = position_list2 ? position_list2->offset + 1 : 1;
+  lcs->position[1]->offset = position_list2 ?
+    position_list2->offset + 1 : prefix_lines + 1;
   lcs->length = 0;
   lcs->refcount = 1;
   lcs->next = NULL;
Index: subversion/libsvn_diff/diff.c
===================================================================
--- subversion/libsvn_diff/diff.c       (revision 1023075)
+++ subversion/libsvn_diff/diff.c       (working copy)
@@ -43,6 +43,22 @@ svn_diff__diff(svn_diff__lcs_t *lcs,
   svn_diff_t *diff;
   svn_diff_t **diff_ref = &diff;
 
+  if (want_common && (original_start > 1))
+    {
+      /* we have a prefix to skip */
+      (*diff_ref) = apr_palloc(pool, sizeof(**diff_ref));
+
+      (*diff_ref)->type = svn_diff__type_common;
+      (*diff_ref)->original_start = 0;
+      (*diff_ref)->original_length = original_start - 1;
+      (*diff_ref)->modified_start = 0;
+      (*diff_ref)->modified_length = modified_start - 1;
+      (*diff_ref)->latest_start = 0;
+      (*diff_ref)->latest_length = 0;
+
+      diff_ref = &(*diff_ref)->next;
+    }
+
   while (1)
     {
       if (original_start < lcs->position[0]->offset
@@ -105,9 +121,12 @@ svn_diff_diff(svn_diff_t **diff,
 {
   svn_diff__tree_t *tree;
   svn_diff__position_t *position_list[2];
+  svn_diff_datasource_e datasource[] = {svn_diff_datasource_original,
+                                        svn_diff_datasource_modified};
   svn_diff__lcs_t *lcs;
   apr_pool_t *subpool;
   apr_pool_t *treepool;
+  apr_off_t prefix_lines = 0;
 
   *diff = NULL;
 
@@ -116,17 +135,23 @@ svn_diff_diff(svn_diff_t **diff,
 
   svn_diff__tree_create(&tree, treepool);
 
+  SVN_ERR(vtable->datasources_open(diff_baton, &prefix_lines, datasource, 2));
+
   /* Insert the data into the tree */
   SVN_ERR(svn_diff__get_tokens(&position_list[0],
                                tree,
                                diff_baton, vtable,
                                svn_diff_datasource_original,
+                               TRUE,
+                               prefix_lines,
                                subpool));
 
   SVN_ERR(svn_diff__get_tokens(&position_list[1],
                                tree,
                                diff_baton, vtable,
                                svn_diff_datasource_modified,
+                               TRUE,
+                               prefix_lines,
                                subpool));
 
   /* The cool part is that we don't need the tokens anymore.
@@ -139,10 +164,10 @@ svn_diff_diff(svn_diff_t **diff,
   svn_pool_destroy(treepool);
 
   /* Get the lcs */
-  lcs = svn_diff__lcs(position_list[0], position_list[1], subpool);
+  lcs = svn_diff__lcs(position_list[0], position_list[1], prefix_lines, 
subpool);
 
   /* Produce the diff */
-  *diff = svn_diff__diff(lcs, 1, 1, TRUE, pool);
+  *diff = svn_diff__diff(lcs, prefix_lines + 1, prefix_lines + 1, TRUE, pool);
 
   /* Get rid of all the data we don't have a use for anymore */
   svn_pool_destroy(subpool);
Index: subversion/libsvn_diff/diff3.c
===================================================================
--- subversion/libsvn_diff/diff3.c      (revision 1023075)
+++ subversion/libsvn_diff/diff3.c      (working copy)
@@ -173,7 +173,7 @@ svn_diff__resolve_conflict(svn_diff_t *hunk,
         position[1]->next = start_position[1];
       }
 
-    *lcs_ref = svn_diff__lcs(position[0], position[1],
+    *lcs_ref = svn_diff__lcs(position[0], position[1], 0,
                              subpool);
 
     /* Fix up the EOF lcs element in case one of
@@ -267,18 +267,24 @@ svn_diff_diff3(svn_diff_t **diff,
                                tree,
                                diff_baton, vtable,
                                svn_diff_datasource_original,
+                               FALSE,
+                               0,
                                subpool));
 
   SVN_ERR(svn_diff__get_tokens(&position_list[1],
                                tree,
                                diff_baton, vtable,
                                svn_diff_datasource_modified,
+                               FALSE,
+                               0,
                                subpool));
 
   SVN_ERR(svn_diff__get_tokens(&position_list[2],
                                tree,
                                diff_baton, vtable,
                                svn_diff_datasource_latest,
+                               FALSE,
+                               0,
                                subpool));
 
   /* Get rid of the tokens, we don't need them to calc the diff */
@@ -289,9 +295,9 @@ svn_diff_diff3(svn_diff_t **diff,
   svn_pool_destroy(treepool);
 
   /* Get the lcs for original-modified and original-latest */
-  lcs_om = svn_diff__lcs(position_list[0], position_list[1],
+  lcs_om = svn_diff__lcs(position_list[0], position_list[1], 0,
                          subpool);
-  lcs_ol = svn_diff__lcs(position_list[0], position_list[2],
+  lcs_ol = svn_diff__lcs(position_list[0], position_list[2], 0,
                          subpool);
 
   /* Produce a merged diff */
Index: subversion/libsvn_diff/diff4.c
===================================================================
--- subversion/libsvn_diff/diff4.c      (revision 1023075)
+++ subversion/libsvn_diff/diff4.c      (working copy)
@@ -194,24 +194,32 @@ svn_diff_diff4(svn_diff_t **diff,
                                tree,
                                diff_baton, vtable,
                                svn_diff_datasource_original,
+                               FALSE,
+                               0,
                                subpool2));
 
   SVN_ERR(svn_diff__get_tokens(&position_list[1],
                                tree,
                                diff_baton, vtable,
                                svn_diff_datasource_modified,
+                               FALSE,
+                               0,
                                subpool));
 
   SVN_ERR(svn_diff__get_tokens(&position_list[2],
                                tree,
                                diff_baton, vtable,
                                svn_diff_datasource_latest,
+                               FALSE,
+                               0,
                                subpool));
 
   SVN_ERR(svn_diff__get_tokens(&position_list[3],
                                tree,
                                diff_baton, vtable,
                                svn_diff_datasource_ancestor,
+                               FALSE,
+                               0,
                                subpool2));
 
   /* Get rid of the tokens, we don't need them to calc the diff */
@@ -222,7 +230,7 @@ svn_diff_diff4(svn_diff_t **diff,
   svn_pool_clear(subpool3);
 
   /* Get the lcs for original - latest */
-  lcs_ol = svn_diff__lcs(position_list[0], position_list[2], subpool3);
+  lcs_ol = svn_diff__lcs(position_list[0], position_list[2], 0, subpool3);
   diff_ol = svn_diff__diff(lcs_ol, 1, 1, TRUE, pool);
 
   svn_pool_clear(subpool3);
@@ -243,7 +251,7 @@ svn_diff_diff4(svn_diff_t **diff,
   /* Get the lcs for common ancestor - original
    * Do reverse adjustements
    */
-  lcs_adjust = svn_diff__lcs(position_list[3], position_list[2], subpool3);
+  lcs_adjust = svn_diff__lcs(position_list[3], position_list[2], 0, subpool3);
   diff_adjust = svn_diff__diff(lcs_adjust, 1, 1, FALSE, subpool3);
   adjust_diff(diff_ol, diff_adjust);
 
@@ -252,7 +260,7 @@ svn_diff_diff4(svn_diff_t **diff,
   /* Get the lcs for modified - common ancestor
    * Do forward adjustments
    */
-  lcs_adjust = svn_diff__lcs(position_list[1], position_list[3], subpool3);
+  lcs_adjust = svn_diff__lcs(position_list[1], position_list[3], 0, subpool3);
   diff_adjust = svn_diff__diff(lcs_adjust, 1, 1, FALSE, subpool3);
   adjust_diff(diff_ol, diff_adjust);

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Reply via email to