On Wed, May 18, 2011 at 9:23 PM, Johan Corveleyn <jcor...@gmail.com> wrote: > On Wed, May 18, 2011 at 7:31 PM, Morten Kloster <mor...@gmail.com> wrote: >> Thanks, Daniel. >> >> Johan: >> I've found your notes trunk/notes/diff-optimizations.txt. As you may >> have realized, my patch is a slightly different approach to the >> problem you discuss in section I.5, "Discarding non-matching >> lines before running the LCS (Longest Common Subsequence) >> algorithm." - rather than discarding the lines before calling LCS, I >> create count information that lets LCS skip the lines without >> counting them as (avoidable) changes. I believe my approach >> is easier to implement (not least now, when it's already done ;) >> and has less overhead, although the potential speed-up is >> somewhat less: > > Yes, I noticed the similarity :-). Your approach is a nice one, and > indeed easier to implement. With the discarding-approach, we'd have to > give each line an alternative line number (a "virtual line number", of > the file without its non-matching lines), then run the LCS on those > lines, and then when producing the diff output go back to using the > original line numbers. Or something like that. > > But I like your idea too, it sounds more light-weight, and might > provide most of the benefits. > >> LCS has running time of O( N + P D ), where N is #tokens in >> longest source, P is number of mismatches in smallest source >> and D is total number of mismatches in both sources. >> Discarding the non-matching tokens would replace both P and D >> by their discarded versions P' and D', giving a potential speed-up >> of the SQUARE of the inverse of the fraction of changed tokens >> that are not unique to one source, whereas my approach >> effectively only replaces P by P', not D, so you only get one >> power, not two (at least I think so; I haven't checked carefully). > > You seem to know your stuff :-). > > For me it's easier to understand by explaining D as "the length of the > shortest edit script (additions+deletes) from largest to smallest > source", and P as "the number of deleted tokens in that shortest edit > script". That's the terminology used in the article referred to by > lcs.c [1]. Or instead of "shortest edit script", I usually use the > term "minimal diff". > > I'm not sure about your calculation though. I should probably study > your patch first, because I haven't quite grokked it yet. But could > you explain why your approach doesn't reduce D (the deleted lines)? > > Also: the discarding-approach reduces N as well, just by making both > files smaller. In the "average case", where expected running time is > O(N + PD), that's probably not a big deal. But according to the > article the worst-case running time is O(NP). So reducing N also has a > big impact on the worst-case running time. Anyway, I speaking purely > theoretical here, I have no idea when this worst-case running time can > occur in practice. > >> There is another possible enhancement that is somewhat >> complementary to the one I've implemented: If the number of >> changed tokens is smaller than the number of uniquely >> matching tokens, and the changes are grouped in well- >> separated sections, then whenever you have found more >> unique matches than the number of changes you have "used", >> you can "restart" the problem from that point: you are >> guaranteed that there is no better match for the earlier >> part of either source. > > I'm not sure about that. I have to chew on it a little bit :-). > > Can you describe this idea in some more detail? Maybe give with an > example or something? > >> This reduces the running time from >> O( N + P' D ) to something like O( N + sum(P'n Dn)), I think, >> where the sum is over the different changed sections. I >> haven't worked out the details on how to best implement >> this approach, but it should work well with my initial patch.
Sorry, forgot the footnote I intended to add: [1] "An O(NP) Sequence Comparison Algorithm", Sun Wu, Udi Manber, Gene Myers -- Johan