On 30 Jul 2021, Karl Fogel wrote:
On 30 Jul 2021, Daniel Shahaf wrote:
What would «svn status» of a modified file without a pristine
say?
How many network/worktree accesses would it involve?
Status would say "modified". The client-side still knows the
fingerprint (hash) of the pristine original, naturally.
I should describe this algorithm [1] in full, to be clearer:
Right now, the client side knows the size and the hash of the
pristine base (regardless of whether the pristine base content is
present). But obviously we'd like to avoid computing the hash of
a (say) 1GB working file every time someone runs 'svn status',
because that's slow. At least, computing the hash should be a
last resort.
Fortunately, there is a multi-step method of determining
modification status in which the hash recomputation can be avoided
in most cases:
1) Compare working file size against pristine base size. If the
sizes are different, you're done: the file's status is 'modified'.
If the sizes are the same, proceed to steps below.
2) OPTIONAL: Compute hash of first 1MB (or some other short
amount) of the working file, and compare that to a recorded hash
of the first 1MB of the pristine base. This would require
changing the Subversion client to save that "short-hash" for the
pristine base, at least when we're not storing the pristine base's
content. So there's a little extra work to get this optional
step, but it's not hard.
If no match, then status == 'modified'. If match, then proceed.
2a) ALSO OPTIONAL: There's a fancier variation on (2). You can
save several N short-hashes evenly spaced across the file -- for
the sake of example, let's say one near the beginning, one in the
middle, and one near the end. Then to check them, you seek() to
the appropriate places, read just a short amount of data, and
compute the short-hash. Merely seek()'ing a distance N is faster
than reading those N bytes of data and computing a hash on them.
If no match, then status == 'modified'. If match, then proceed.
3) If and only all the previous steps have matched, compute the
full hash of the working file and compare it to the recorded full
hash for the pristine base.
With most binary blobs, it's unlikely that even the sizes would be
exactly the same, so the common case is that the method gives a
result in step (1). It's even rarer for the sizes to match *and*
the first 1MB (or whatever length) to match. While one might
occasionally reach step (3), I suspect that would be uncommon even
if we don't implement step (2a). I don't think it would be worth
implementing (2a) in an initial implementation, and I'm not even
sure step (2) itself is necessary in practice.
Best regards,
-Karl
[1] This algorithm, minus step (2a), is the one I use in
https://github.com/OpenTechStrategies/ots-tools/blob/master/find-dups .