Re: A two-part vision for Subversion and large binary objects.

Karl Fogel Sat, 31 Jul 2021 09:03:55 -0700

On 30 Jul 2021, Karl Fogel wrote:

On 30 Jul 2021, Daniel Shahaf wrote:
What would «svn status» of a modified file without a pristinesay?
How many network/worktree accesses would it involve?
Status would say "modified". The client-side still knows thefingerprint (hash) of the pristine original, naturally.


I should describe this algorithm [1] in full, to be clearer:

Right now, the client side knows the size and the hash of thepristine base (regardless of whether the pristine base content ispresent). But obviously we'd like to avoid computing the hash ofa (say) 1GB working file every time someone runs 'svn status',because that's slow. At least, computing the hash should be alast resort.

Fortunately, there is a multi-step method of determiningmodification status in which the hash recomputation can be avoidedin most cases:

1) Compare working file size against pristine base size. If thesizes are different, you're done: the file's status is 'modified'.If the sizes are the same, proceed to steps below.

2) OPTIONAL: Compute hash of first 1MB (or some other shortamount) of the working file, and compare that to a recorded hashof the first 1MB of the pristine base. This would requirechanging the Subversion client to save that "short-hash" for thepristine base, at least when we're not storing the pristine base'scontent. So there's a little extra work to get this optionalstep, but it's not hard.


If no match, then status == 'modified'.  If match, then proceed.

2a) ALSO OPTIONAL: There's a fancier variation on (2). You cansave several N short-hashes evenly spaced across the file -- forthe sake of example, let's say one near the beginning, one in themiddle, and one near the end. Then to check them, you seek() tothe appropriate places, read just a short amount of data, andcompute the short-hash. Merely seek()'ing a distance N is fasterthan reading those N bytes of data and computing a hash on them.


If no match, then status == 'modified'.  If match, then proceed.

3) If and only all the previous steps have matched, compute thefull hash of the working file and compare it to the recorded fullhash for the pristine base.

With most binary blobs, it's unlikely that even the sizes would beexactly the same, so the common case is that the method gives aresult in step (1). It's even rarer for the sizes to match *and*the first 1MB (or whatever length) to match. While one mightoccasionally reach step (3), I suspect that would be uncommon evenif we don't implement step (2a). I don't think it would be worthimplementing (2a) in an initial implementation, and I'm not evensure step (2) itself is necessary in practice.


Best regards,
-Karl

[1] This algorithm, minus step (2a), is the one I use in
https://github.com/OpenTechStrategies/ots-tools/blob/master/find-dups .

Re: A two-part vision for Subversion and large binary objects.

Reply via email to