Hello Julian, Hello Johan, On Montag, 22. März 2010, Julian Foad wrote: ... > My basic idea is an algorithm on the client that takes, as its input, a > sequence of binary diffs, one for each revision in which the file > changed, representing that incremental change. ... > The algorithm applies one of these diffs at a time to an in-memory data > structure that represents which revision is associated with each byte in > the file. ... > Then, at the end, we read in the blamed file's text (just that one > revision of it), match up the byte ranges to the lines in the file, and > for each line we display the highest recorded revision number that falls > within that line. please forgive me for intruding with barely some knowledge to share, but that sounds awful because of the round-trips (and latencies) involved.
I'd suggest trying to keep the (current) version of the file line-wise in memory (stored as line-number, revision, start byte, length, and actual bytes); an incoming binary delta would just change the data and revision, until the deltas are exhausted. Then the blame output can be given directly, without any more round-trips. Of course, that means splitting xdelta operations into line-wise bits; but, as an improvement, you could also print the source line number of a change (by just "counting" the lines from top to bottom when applying a delta). > A backward algorithm would be a little more complex ... Please see my comment at the end of my mail. > I am still not completely sure whether the binary diff idea can produce > a line-wise result that looks natural and minimal, while still being > efficient. For example, I suspect there may be situations where the > binary diff will show changes within two different lines where a text > diff would have showed a change only to one line, but I am not sure. > Even if effects like this do occur, they might only happen in cases > where a human reader would agree that that is a good way to describe the > change anyway. That might not be true, in case of an ignore-whitespace-changes option or something like that. > The important first step is to do some profiling to find out how much > time is being spent - > - on the server, calculating full texts from the stored diffs? > - transmitting data over the network? > - on the client, producing diffs? > - on the client, producing blame info from diffs? That's always good advise. > I've just re-read your message from last year > <http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId= > 2079688> in which you report it's mainly client-side CPU bound. That's > good to know. (That thread is linked from the issue.) So next, you would > be well advised to profile the execution of the client code and see > whether the textual diffs or some other part of the processing are taking > the most time. > > The clear advantage of the binary diff algorithm is that on the client > it neither calculated diffs nor even looks at the older revisions of the > file's text, so that might be a big win if it works at all. > > As regards reversing the current algorithm, first you might want to find > out what is the oldest revision number that shows up in the blame of > your typical file, and compare that with all the changes in the file's > history. Is it only reporting the most recent 1% of the file's changes, > in which case the reversed algorithm could achieve a 100-fold speed-up, > or 20%, or 80%? Well, how about using an iterative approach? Just do the algorithm you described for the revisions [N-100:N]; if there's any line that's still at N-100 fetch the blame information for [N-200:N-100], and substitute the 'unknown' lines; repeat as necessary. Sounds simpler than a full reverse run, and might be efficient enough - only a few revisions more than absolutely necessary are fetched and analyzed. Regards, Phil