On 30 Jul 2021, Karl Fogel wrote:
On 30 Jul 2021, Daniel Shahaf wrote:
What would «svn status» of a modified file without a pristine say?
How many network/worktree accesses would it involve?

Status would say "modified". The client-side still knows the fingerprint (hash) of the pristine original, naturally.

I should describe this algorithm [1] in full, to be clearer:

Right now, the client side knows the size and the hash of the pristine base (regardless of whether the pristine base content is present). But obviously we'd like to avoid computing the hash of a (say) 1GB working file every time someone runs 'svn status', because that's slow. At least, computing the hash should be a last resort.

Fortunately, there is a multi-step method of determining modification status in which the hash recomputation can be avoided in most cases:

1) Compare working file size against pristine base size. If the sizes are different, you're done: the file's status is 'modified'. If the sizes are the same, proceed to steps below.

2) OPTIONAL: Compute hash of first 1MB (or some other short amount) of the working file, and compare that to a recorded hash of the first 1MB of the pristine base. This would require changing the Subversion client to save that "short-hash" for the pristine base, at least when we're not storing the pristine base's content. So there's a little extra work to get this optional step, but it's not hard.

If no match, then status == 'modified'.  If match, then proceed.

2a) ALSO OPTIONAL: There's a fancier variation on (2). You can save several N short-hashes evenly spaced across the file -- for the sake of example, let's say one near the beginning, one in the middle, and one near the end. Then to check them, you seek() to the appropriate places, read just a short amount of data, and compute the short-hash. Merely seek()'ing a distance N is faster than reading those N bytes of data and computing a hash on them.

If no match, then status == 'modified'.  If match, then proceed.

3) If and only all the previous steps have matched, compute the full hash of the working file and compare it to the recorded full hash for the pristine base.

With most binary blobs, it's unlikely that even the sizes would be exactly the same, so the common case is that the method gives a result in step (1). It's even rarer for the sizes to match *and* the first 1MB (or whatever length) to match. While one might occasionally reach step (3), I suspect that would be uncommon even if we don't implement step (2a). I don't think it would be worth implementing (2a) in an initial implementation, and I'm not even sure step (2) itself is necessary in practice.

Best regards,
-Karl

[1] This algorithm, minus step (2a), is the one I use in
https://github.com/OpenTechStrategies/ots-tools/blob/master/find-dups .

Reply via email to