Re: [PATCH] Speed-up of libsvn_diff using token counts

Morten Kloster Tue, 24 May 2011 13:23:27 -0700

Johan / Stefan - any progress on the reviewing part?

Only possible error I'm aware of is that I made the indices/counts
32 bit signed/unsigned integers, which means it'll run into trouble
if there are more than ~2 billion different tokens (lines) or more than
~4 billion occurrences of the same token. If the library should be
able to handle such files, they should be made 64 bit (for the
indices, it should be safe to keep them 32 bit on 32 bit machines,
as those could never have the memory to run such diffs anyway).



I have a couple of other changes ready:

One is a minor simplification, by getting rid of the idx parameter in
lcs.c - there is a simpler way to accomplish what it's for.

The other change is a partial implementation of the other possible
enhancement I mentioned earlier, which takes advantage of unique
matches to "restart" the lcs algorithm whenever possible, thus
reducing the effect of the square computational complexity. The
"partial" part is that this version is only really effective if the two
files have roughly the same number of (non-unique) lines; it will not
reduce the (P * length difference) part of the complexity. For good
cases, though, it can reduce the time required by the lcs algorithm
by large factors (roughly the number of well-separated sections of
difference, if they are all about the same size).

The question is how I should post these changes. The idx change
is relatively independent of the others; should I post a patch from
current HEAD revision for that change alone, should I post it in this
thread or a new one, and should I specify how that patch interacts
with the other patches (and if so, where)? The other patch is based
on my original patch; should I post a patch of the additional
changes or a cumulative patch from current HEAD revision?

Re: [PATCH] Speed-up of libsvn_diff using token counts

Reply via email to