On Tue, Jun 14, 2011 at 11:12 PM, Johan Corveleyn <jcor...@gmail.com> wrote: > On Tue, Jun 14, 2011 at 7:32 PM, Morten Kloster <mor...@gmail.com> wrote: >> On Tue, Jun 14, 2011 at 2:49 AM, Johan Corveleyn <jcor...@gmail.com> wrote: >>> On Tue, Jun 7, 2011 at 10:16 PM, Morten Kloster <mor...@gmail.com> wrote: >>>> >> [] >>>> Here's an updated patch based on the current HEAD, and with >>>> one or two minor improvements. While, as stated, the patch >>>> can give many-fold speed increases for ideal cases, it also slows >>>> down the LCS algorithm significantly for poorly matching files; I >>>> got about 18% increase in running time on my system for files >>>> that were very poor matches but had no lines unique to either file. >>> >>> Hi Morten, >>> >>> Hmmm, that sounds like a lot of overhead in some cases (I actually >>> wonder why -- I haven't fully understood the implementation, but on >>> first sight it doesn't seem that computationally intensive). But the >>> most important question that's bugging me: do those "ideal cases" >>> occur often in real-world examples? >>> >> >> It's not surprising to get that kind of speed reduction, since it uses >> an extra test within the loops that take the bulk of the time when >> the files don't match. > > Indeed, that might be it :-). Just wondering though: it isn't by any > chance the case that, worst case, clear_fp is called too often (more > often than is useful, each time resetting the LCS run for only > marginal gain)? > > Can you verify that those 18% are spent because of those extra > comparisons, and not because of calls to clear_fp? IIUC, with those > poorly-matching files, there should never be a call to clear_fp, > because the unique-matching optimization never kicks in, right? >
As you say, clear_fp is never called in this case. But there is actually a risk of slowdown for well-matching files due to calls to clear_fp (during the loop over k for p=0; clear_fp will clear more than it really needs to). It shouldn't be too hard to avoid that, though. > Also, those 18% you measured: is that the increase in > svn_diff_file_diff time, or in lcs time? Or is that approximately the > same, because the lcs time dominates because of the poor matching? > They are essentially the same in this case. > > I have two more, possibly stupid, questions on the implementation in the > patch: > > - You mentioned in your theoretical explanation that you still have to > scan the entire d-range for every p, even after you've found a "reset > point". But can't you just "break" from the for-loops, whenever you > encounter a reset point? > No, I don't have to scan after I find a reset point (the loops just continue in post-reset mode; there's no waste there). The issue is that svn_diff__lcs still looks for a p=0 solution even after a unique possible match has failed, which clearly is no longer possible (it's not restricted to p=0, that's just the simplest case). > - You calculated 'limit' as: > limit = (d >= 0 ? 2 * p + d : 2 * p - d); > and made the "reset condition" be: > if (fp[k].uniquematchcount > limit + k) /* with k < 0 */ > or: > if (fp[k].uniquematchcount > limit - k) /* with k >= 0 */ > > I don't fully understand. Can you explain a bit more why this is the > right value for 'limit', and how this is the correct "reset > condition"? > Those limits are actually overkill. The correct condition is that the number of unique matches must be larger than (or equal to, I guess) the number of mismatches that could conceivably be corrected by matching the files otherwise. I have here used the total number of mismatches in BOTH files, but the larger number of mismatches in one file would be sufficient. >>> I'm sure you could craft good examples, but how likely am I to have a >>> benefit of this in my average repository (given all of the other >>> optimizations taking away already a lot of the work (prefix/suffix >>> elimination; elimination of non-matching lines due to token-counting; >>> ...))? >>> >>> Can you give examples from subversion's own repository, or other >>> public repositories, where it can have a big impact? >>> >> >> I ran it on merge.c (in libsvn_client), between revisions 1127198 (HEAD, >> or close enough) and 1036419 (semi-randomly chosen to have about >> the right fraction of changes from 1127198 - it was the first one I looked >> at, and seemed reasonable for the test). When running all of >> svn_diff_file_diff, the patch was about 15% faster, however, the LCS >> algorithm itself was about 3.5 times faster; most of the time was >> spent reading the tokens from file. > > That's quite neat! It's a lot more than I expected in an average case. > Maybe some more testing should be done (not saying you have to do this > per say, just in general) to see how average this is :-). But for now > it's already great to see a single real-world example like the one you > just gave. > >> The "problem" is that the files have to be quite big before the LCS >> algorithm itself dominates the running time even when the files still >> match reasonably well, whereas it is much easier for the LCS >> algorithm to dominate the running time when the files don't match >> at all, in which case this patch degrades performance. > > Yeah, I know :-). The token-scanning/hashing phase is still quite a > heavy factor in the diff performance (especially when the matching of > the files is reasonably good (yielding a fast LCS-run), which is > typically the case with changes in source code). > > Stefan Fuhrmann (stefan2) has some great ideas for optimizing the > token-scanning phase, which could help a lot for that. We had some > nice discussions about it during the Berlin Hackathon last month (but > I didn't have the chance yet to dig into it further). But that's all > definitely 1.8 or later stuff. > > Also, in the same vein, I think this current patch will have to wait > for post-1.7. I like the idea a lot, and would definitely like to see > it in svn one day, but it's not such a trivial change. With 1.7 just > around the corner, we're trying to avoid changes that are not directly > related to getting 1.7 (finally) out the door :-). > >> It is quite easy to do a quick heuristic test for whether this patch >> would help or not, so we could have alternate versions of the LCS >> algorithm depending on whether that test checks out. That would >> make the code somewhat harder to maintain, of course. > > That might be a good idea, to get really the best of both worlds. > Though I think it would also be acceptable to optimize for the common > case (reasonably matching files (modulo all the "non-matching > lines")), at a slight cost for the rare case (files that are matching > terribly bad, even with "skipping all the non-matching lines"). Not > sure right now, I guess it depends on the cost of the added code > complexity. > > Definitely something to look at post-1.7 :-). > >>> Or are you looking at this for some specific purpose, some repository >>> where this situation occurs regularly and has a big impact (huge >>> LCS'es ...)? >>> >> >> Nope, I just like to optimize stuff, and since we were already >> looking at diff optimization I thought I might as well point out the >> possibility. > > Cool :-). Me too. > > -- > Johan >