Re: [PATCH] Simplification/speed-up of libsvn_diff by eliminating idx

Stefan Fuhrmann Sun, 29 May 2011 04:28:06 -0700

On 27.05.2011 19:29, Morten Kloster wrote:

On Fri, May 27, 2011 at 3:13 PM, Julian Foad<julian.f...@wandisco.com>  wrote:

On Fri, 2011-05-27 at 13:32 +0200, Johan Corveleyn wrote:

On Fri, May 27, 2011 at 12:47 PM, Fuhrmann Stefan (ETAS/ESA1)
<stefan.fuhrm...@etas.com>  wrote:

Morten Kloster wrote:

[[[
Simpler/faster LCS algorithm in libsvn_diff by elimination of idx.

* subversion/libsvn_diff/lcs.c
  (svn_diff__snake_t): Removed idx parameter (always 0)
  (svn_diff__lcs_t): Removed idx variable (always 0) , let d have either
   sign, and moved the origo of fp by d. New version no longer chooses
   the shorter file as the "original", and can thus give different LCS's
   depending on the order of the input files even when the files have
   different lengths. Speed is ~15% faster for core lcs algorithm.
]]]

Hi Morten,

Thank you very much for the patch. From what I have seen,
it should be correct (haven't applied&  tested it, yet).

The old version of the LCS algorithm started at k=0 and ended at
k=d, where d is the length difference of the two files. d was required
to be positive. It is more efficient to start at k=(-)d and end at 0 (since
ending the loops at 0 allows for faster comparisons), and this also
makes it easy to let d have either sign, which makes the idx variable
unnecessary, further increasing speed by eliminating a parameter
to the frequently called function svn_diff__snake_t. The sign of d
is handled in the initial values of the loops (which have minimal speed
impact).

Looks good.

The new version will more often give different (but equally good) results
depending on the order of the input files.

How does that affect our test suite? Is any of its expected
outputs invalid now? Is the result of blame affected as well
(e.g. if a file shrunk and / or grew in later revisions)?

The patch passes all libsvn_diff tests, at least. After some
thought, I found that the new version will actually improve the
performance of merge: Given files

abcdef  (Base file)
abdce  (Their file)
abdcefg  (My file)

the old version will diff (Their, Base) and (Base, My), since
Their is shorter than Base while My is longer. That means that
for Base vs. Their, the 'd's will be matched, while for Base vs.
My, the 'c's will be matched, yielding a conflict during merge,
even though that part of the files had identical changes in Their
and My! With the new version, the diff order is given strictly by
the calling function, thus it will match the same way in both
cases and avoid conflict (both version produce a conflict at the
end, of course).

In general, the local diff between mostly-matching parts of two
files can no longer be affected by changes in those files far
from the area of interest (except if those changes cause a
global change in LCS), and I'm pretty sure that's all for the good.

The result of blame can be affected by the change, but only
in cases where it essentially was arbitrary to begin with. I
don't see that it can have any significant negative impact.

Ok.

I think the bigger point of you patch is the
"simplification" part rather than the "faster"
part.

I estimated the speed
difference by diffing all of the first 100 revisions of merge.c in
libsvn_client against each other (when diffing only against the previous
revision, the core LCS algorithm usually takes a minor part of the total
running time, so this is not suitable for gauging the LCS algorithm
speed).

The speedup is plausible. Did you test it with 32 bits?

I'm running on a 64-bit machine/OS but using the win32 builds.

I can see that the register pressure is reduced,
particularly under 32 bits. And that may also cause
some compilers to actually inline the *__snake
code.

GCC does the latter anyways and under 64 bits
the gains are under 1% for the *_lcs function.
However, simpler code and more predictable
code often opens the door to more improvements
such as eliminating pointer dereferencing,
re-organizing data to make it more "streamy" etc.

/* k = -1 will be the first to be used to get previous
     * position information from, make sure it holds sane
     * data
     */
-  fp[-1].position[0] = sentinel_position[0].next;
-  fp[-1].position[1] =&sentinel_position[1];
+  fp[d - 1].position[0] = sentinel_position[0].next;
+  fp[d - 1].position[1] =&sentinel_position[1];

Comment and code do not match anymore. Could you
update your patch to fix that? Also, it would be very
nice, if you added more commentary than there is today,
especially the less obvious index handling (why fp[d] will
always be valid etc.)

I've attached an updated version.

As I said, I like your patch and will certainly apply it as
soon as above points have been addressed.

I'm not sure I agree. I'm a bit suspicious about the ~15% speedup.
Where does this come from? I can't imagine that the "faster
comparisons by ending at 0" accounts for this. And even so, maybe
there is another way to achieve that.

15% is more than I expected, but that's what I get. Surprisingly,
none of the speed-up seems to be from the removal of idx; it's all
from the change to the k loops.

Depending on what tool you used to measure
that, you may or may not see whether the code
got inlined there or not. Inlining would be a
perfect explanation for a 15% gain.

I'm suspicious mainly on theoretical grounds: the article on which
this algorithm is based specifically always starts with the shortest
of both files. This is important, because the time complexity of the
algorithm is proportional to P, the number of deleted lines. If you
start with the largest of the two, that number P will usually be
larger than the other way around.

Specifically, the article states: if A (the shortest file) is a
subsequence of B (the longest), the algorithm's complexity will be
linear (O(N)). If you would start with B, I'm pretty sure that won't
be the case.

Concrete example:
A: 12345
B: 41521334254

Going from A ->  B: P==0
Going from B ->  A: P==6

The existing algorithm ensures that, no matter what order you give the
arguments to "diff", the LCS algorithm will always use the direction A
->  B for its work. which should be the most efficient.

Maybe I'm misunderstanding this, or maybe there has been further
research in Computer Science around this (I'm not an academic), but if
so, please explain.

The problem where you start with file A (short) and end with file B) is
completely symmetrical to the problem where you start with file B and
end with file A - all you have to do is swap inserts and deletes. The old
algorithm assumed there would be fewer deletes than inserts, while my
new approach just uses whichever there is fewer of. Note the two loops
for k: The first loop does free inserts until the remaining lengths of the
files are equal, while the second does free deletes until the remaining
lengths are equal. (Each loop can accommodate at most one of the
opposite operation for each value of p, so those are not free.)

Maybe some of the performance improvement (and simplification) can be
achieved simply by calculating idx0 and idx1 once, and then reusing
those variables and pass them on:

     idx0 = length[0]>  length[1] ? 1 : 0;
     idx1 = abs(1 - idx0);

instead of re-calculating abs(1 - idx) everywhere.

Adding a parameter to the calls to svn_diff__snake would as likely increase
the running time. The usage of idx is almost certainly irrelevant anyway,
since new lcs structs are only rarely made during normal diffs.

Why does it use "abs()", I wonder?  That looks redundant if idx can only
be 0 or 1.

I wondered the same thing myself! :-)

I'm not paying much attention or deeply understanding all of this, but
I'm curious so I took a quick look.

Morten Kloster wrote:

* subversion/libsvn_diff/lcs.c
    (svn_diff__snake_t): Removed idx parameter (always 0)
    (svn_diff__lcs_t): Removed idx variable (always 0) , let d have either
     sign, and moved the origo of fp by d.

The functions are called svn_diff__snake and svn_diff__lcs, without the
_t suffix.

Thanks; I was thinking of the names of the structs.

In most of your patch you assume that 'idx' would have been 0, but in
two places you assume it would have been 1.  Is this intentional?  Here:

-  fp += length[idx] + 1;
+  fp += length[1] + 1;
-  d = length[abs(1 - idx)] - length[idx];
+  d = length[0] - length[1];

- Julian

Actually, I still assume that idx is 0, but I need the opposite value from
before: Instead of starting at k=0 and ending at k=d, we now start at k=d
and end at k=0. That means the sign of d must be opposite of what it
was before (d =), and the amount of space needed to the left and right
of origo are switched (fp +=).


Updated log message:
[[[
Simpler/faster LCS algorithm in libsvn_diff by elimination of idx.

* subversion/libsvn_diff/lcs.c
  (svn_diff__snake): Removed idx parameter (always 0)
  (svn_diff__lcs): Removed idx variable (always 0) , let d have either
   sign, and moved the origo of fp by d. New version no longer chooses
   the shorter file as the "original", and can thus give different LCS's
   depending on the order of the input files even when the files have
   different lengths. Speed is ~15% faster for core lcs algorithm.
]]]

Committed as r1128857. I took the liberty to
weaken the speed-up claim.

-- Stefan^2.

Re: [PATCH] Simplification/speed-up of libsvn_diff by eliminating idx

Reply via email to