Re: A two-part vision for Subversion and large binary objects.

Daniel Shahaf Tue, 08 Mar 2022 05:17:52 -0800

Karl Fogel wrote on Mon, Mar 07, 2022 at 13:44:03 -0600:
> On 07 Mar 2022, Mark Phippard wrote:
> > > I do understand the reasons why Evgeny thought pre-fetching
> > > pristines for modified files as part of an 'update' could be a
> > > good idea.
> > 
> > My recollection of the first version of this patch, commit needed the
> > pristine and so had to fetch it before the commit happened. This may
> > have been a reason it seemed like a good idea at the time for update
> > to get the pristine.
> 
> Ah, maybe so; I didn't realize that.
> 
> If that was the motivation, then there's even less reason for 'update' to
> fetch pristines for modified files.  Having the pristine is not only
> unnecessary for the commit, in most cases having the pristine is not even
> particularly *useful* to the commit.  These types of files tend to be
> non-diffable anyway (i.e., not even binary diffable), broadly speaking and
> with occasional exceptions of course.  For example, a common such file is a
> gigantic gzipped blob.  Tiny changes in the uncompressed text will lead to a
> completely different gzipped blob.


And «update» could send a self-compressed delta anyway.

> (I suppose it might be the case that if the first change is made very late
> in the uncompressed text, then the revised gzipped blob can, under some
> real-world circumstances, actually be bit-for-bit the same as the original
> for a long initial prefix before showing any difference.  But this is a rare
> enough case that I don't think Subversion should be trying to detect it and
> support it.  We'd essentially have to incorporate the rsync rolling-checksum
> algorithm, or something like it, into our diff negotiation to even get any
> advantage.)

This use-case may be a rare one, but rsync _was_ in fact designed to
solve precisely the problem that «svn commit» of a pristineless file
needs to solve.  So, suppose we did use the rsync algorithm, would this
benefit any other use-cases other than the "first change is at the end
of the file" use-case you describe here?  Is it faster to commit a file
by sending a self-delta of it or by rsync'ing it?

Furthermore, the user may be able to deliberately create the huge file
in a way that makes it rsync-friendly: for instance, `svnadmin dump`
emits hashes in sorted order, which has the side-effect of making dump
files rsync-friendly.  For gzip files there is «gzip --rsyncable».

None of this is needed for the MVP, of course, but I do think the basic
principle of using rsync is in fact sound.

An alternative is to require the user to let svn know before they're
starting to edit a file, so we can create a pristine off the on-disk
file.  This way we won't have pristineless modified files in the first
place.

> And in the absence of fancy cross-network common-prefix detection code that
> we're not going to write, this would just be cost-shifting anyway.  Whatever
> commit-time improvement one would gain from having the pristine locally
> would be offset by the extra time spent fetching the pristine to make that
> commit-time improvement possible.

What assumptions is this conclusion valid under?  It seems to this
conclusion assumes, at least, that the uplink and downlink bandwidths
are equal and that the pristine will be needed exactly once (i.e.,
a hydrate-commit-dehydrate sequence).

I'm not objecting to making assumptions; we aren't going to address all
use-cases in 1.15.  I'm just asking that we make our assumptions explicit.

Cheers,

Daniel

> So... yeah.  Let's not do that :-).
> 
> Best regards,
> -Karl

Re: A two-part vision for Subversion and large binary objects.

Reply via email to