On Fri, 2011-05-27 at 13:32 +0200, Johan Corveleyn wrote: > On Fri, May 27, 2011 at 12:47 PM, Fuhrmann Stefan (ETAS/ESA1) > <stefan.fuhrm...@etas.com> wrote: > > Morten Kloster wrote: > > > >> [[[ > >> Simpler/faster LCS algorithm in libsvn_diff by elimination of idx. > >> > >>* subversion/libsvn_diff/lcs.c > >> (svn_diff__snake_t): Removed idx parameter (always 0) > >> (svn_diff__lcs_t): Removed idx variable (always 0) , let d have either > >> sign, and moved the origo of fp by d. New version no longer chooses > >> the shorter file as the "original", and can thus give different LCS's > >> depending on the order of the input files even when the files have > >> different lengths. Speed is ~15% faster for core lcs algorithm. > >>]]] > > > > Hi Morten, > > > > Thank you very much for the patch. From what I have seen, > > it should be correct (haven't applied & tested it, yet). > > > >> The old version of the LCS algorithm started at k=0 and ended at > >> k=d, where d is the length difference of the two files. d was required > >> to be positive. It is more efficient to start at k=(-)d and end at 0 (since > >> ending the loops at 0 allows for faster comparisons), and this also > >> makes it easy to let d have either sign, which makes the idx variable > >> unnecessary, further increasing speed by eliminating a parameter > >> to the frequently called function svn_diff__snake_t. The sign of d > >> is handled in the initial values of the loops (which have minimal speed > >> impact). > > > > Looks good. > > > >> The new version will more often give different (but equally good) results > >> depending on the order of the input files. > > > > How does that affect our test suite? Is any of its expected > > outputs invalid now? Is the result of blame affected as well > > (e.g. if a file shrunk and / or grew in later revisions)? > > > >> I estimated the speed > >> difference by diffing all of the first 100 revisions of merge.c in > >> libsvn_client against each other (when diffing only against the previous > >> revision, the core LCS algorithm usually takes a minor part of the total > >> running time, so this is not suitable for gauging the LCS algorithm > >> speed). > > > > The speedup is plausible. Did you test it with 32 bits? > > > >> /* k = -1 will be the first to be used to get previous > >> * position information from, make sure it holds sane > >> * data > >> */ > >> - fp[-1].position[0] = sentinel_position[0].next; > >> - fp[-1].position[1] = &sentinel_position[1]; > >> + fp[d - 1].position[0] = sentinel_position[0].next; > >> + fp[d - 1].position[1] = &sentinel_position[1]; > > > > Comment and code do not match anymore. Could you > > update your patch to fix that? Also, it would be very > > nice, if you added more commentary than there is today, > > especially the less obvious index handling (why fp[d] will > > always be valid etc.) > > > > As I said, I like your patch and will certainly apply it as > > soon as above points have been addressed. > > I'm not sure I agree. I'm a bit suspicious about the ~15% speedup. > Where does this come from? I can't imagine that the "faster > comparisons by ending at 0" accounts for this. And even so, maybe > there is another way to achieve that. > > I'm suspicious mainly on theoretical grounds: the article on which > this algorithm is based specifically always starts with the shortest > of both files. This is important, because the time complexity of the > algorithm is proportional to P, the number of deleted lines. If you > start with the largest of the two, that number P will usually be > larger than the other way around. > > Specifically, the article states: if A (the shortest file) is a > subsequence of B (the longest), the algorithm's complexity will be > linear (O(N)). If you would start with B, I'm pretty sure that won't > be the case. > > Concrete example: > A: 12345 > B: 41521334254 > > Going from A -> B: P==0 > Going from B -> A: P==6 > > The existing algorithm ensures that, no matter what order you give the > arguments to "diff", the LCS algorithm will always use the direction A > -> B for its work. which should be the most efficient. > > Maybe I'm misunderstanding this, or maybe there has been further > research in Computer Science around this (I'm not an academic), but if > so, please explain. > > Maybe some of the performance improvement (and simplification) can be > achieved simply by calculating idx0 and idx1 once, and then reusing > those variables and pass them on: > > idx0 = length[0] > length[1] ? 1 : 0; > idx1 = abs(1 - idx0); > > instead of re-calculating abs(1 - idx) everywhere.
Why does it use "abs()", I wonder? That looks redundant if idx can only be 0 or 1. I'm not paying much attention or deeply understanding all of this, but I'm curious so I took a quick look. Morten Kloster wrote: > * subversion/libsvn_diff/lcs.c > (svn_diff__snake_t): Removed idx parameter (always 0) > (svn_diff__lcs_t): Removed idx variable (always 0) , let d have either > sign, and moved the origo of fp by d. The functions are called svn_diff__snake and svn_diff__lcs, without the _t suffix. In most of your patch you assume that 'idx' would have been 0, but in two places you assume it would have been 1. Is this intentional? Here: > - fp += length[idx] + 1; > + fp += length[1] + 1; > - d = length[abs(1 - idx)] - length[idx]; > + d = length[0] - length[1]; - Julian