Hello Stefan,

On Mittwoch, 3. März 2010, Stefan Sperling wrote:
> A block is "just another pristine".
> So a block can happen to also serve as a pristine for a different 10MB
> file which happens to have the same content as the block.
> We set the block size to something fixed, like 10MB.
> The last block is allowed to be smaller and runs till EOF.
You could go a little step further and determine the block borders by a 
synchronizing checksum, ie. a manber-hash.

FSVS does this; the advantages are
a) you can quit comparing the files (or hashs) as soon as a different block
   is found
b) sending nearly-optimal deltas is trivial, because you know that the file
   originally had the blocks A,B,C,D and E, and now the blocks hash to
   A,X,C,Y and E - so you can just tell the server "the first length(A) bytes 
   are identical, then you need X, then length(C) are unchanged, ..."
   Of course, that's not byte-optimized, but AFAIK the server will re-compute
   the delta anyway (for BDB and FSFS).
c) identical blocks from different files get shared; so keeping 3 revisions of
   a 100MB file might need only 100 + epsilon space, and that's without having
   the blocks compressed.
d) small point: the manber-CRC provide a few bits more security that there's
   no hash-collision.

So on commit the content verification goes once through the file, and records
offsets and length of the blocks (and whether they did exist in the older 
version); for transferring the delta only some seeks and writes are needed.

The manber-hash block size can be chosen arbitrarily; in FSVS I'm using 
~128kB, which means that small files just have a single hash to store, but for 
larger ones the IO savings get realized.



Regards,

Phil

Reply via email to