Hello Julian,
Hello Johan,

On Montag, 22. März 2010, Julian Foad wrote:
...
> My basic idea is an algorithm on the client that takes, as its input, a
> sequence of binary diffs, one for each revision in which the file
> changed, representing that incremental change.
...
> The algorithm applies one of these diffs at a time to an in-memory data
> structure that represents which revision is associated with each byte in
> the file.
...
> Then, at the end, we read in the blamed file's text (just that one
> revision of it), match up the byte ranges to the lines in the file, and
> for each line we display the highest recorded revision number that falls
> within that line.
please forgive me for intruding with barely some knowledge to share, but that 
sounds awful because of the round-trips (and latencies) involved.

I'd suggest trying to keep the (current) version of the file line-wise in 
memory (stored as line-number, revision, start byte, length, and actual 
bytes); an incoming binary delta would just change the data and revision, 
until the deltas are exhausted.

Then the blame output can be given directly, without any more round-trips.

Of course, that means splitting xdelta operations into line-wise bits; but, as 
an improvement, you could also print the source line number of a change (by 
just "counting" the lines from top to bottom when applying a delta).

> A backward algorithm would be a little more complex ...
Please see my comment at the end of my mail.

> I am still not completely sure whether the binary diff idea can produce
> a line-wise result that looks natural and minimal, while still being
> efficient.  For example, I suspect there may be situations where the
> binary diff will show changes within two different lines where a text
> diff would have showed a change only to one line, but I am not sure.
> Even if effects like this do occur, they might only happen in cases
> where a human reader would agree that that is a good way to describe the
> change anyway.
That might not be true, in case of an ignore-whitespace-changes option or 
something like that.
 
> The important first step is to do some profiling to find out how much
> time is being spent -
>   - on the server, calculating full texts from the stored diffs?
>   - transmitting data over the network?
>   - on the client, producing diffs?
>   - on the client, producing blame info from diffs?
That's always good advise.
 
> I've just re-read your message from last year
> <http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId=
> 2079688> in which you report it's mainly client-side CPU bound.  That's
>  good to know.  (That thread is linked from the issue.)  So next, you would
>  be well advised to profile the execution of the client code and see
>  whether the textual diffs or some other part of the processing are taking
>  the most time.
> 
> The clear advantage of the binary diff algorithm is that on the client
> it neither calculated diffs nor even looks at the older revisions of the
> file's text, so that might be a big win if it works at all.
> 
> As regards reversing the current algorithm, first you might want to find
> out what is the oldest revision number that shows up in the blame of
> your typical file, and compare that with all the changes in the file's
> history.  Is it only reporting the most recent 1% of the file's changes,
> in which case the reversed algorithm could achieve a 100-fold speed-up,
> or 20%, or 80%?
Well, how about using an iterative approach? Just do the algorithm you 
described for the revisions [N-100:N]; if there's any line that's still at 
N-100 fetch the blame information for [N-200:N-100], and substitute the 
'unknown' lines; repeat as necessary.

Sounds simpler than a full reverse run, and might be efficient enough - only a 
few revisions more than absolutely necessary are fetched and analyzed.


Regards,

Phil

Reply via email to