Hello Stefan, On Mittwoch, 3. März 2010, Stefan Sperling wrote: > A block is "just another pristine". > So a block can happen to also serve as a pristine for a different 10MB > file which happens to have the same content as the block. > We set the block size to something fixed, like 10MB. > The last block is allowed to be smaller and runs till EOF. You could go a little step further and determine the block borders by a synchronizing checksum, ie. a manber-hash.
FSVS does this; the advantages are a) you can quit comparing the files (or hashs) as soon as a different block is found b) sending nearly-optimal deltas is trivial, because you know that the file originally had the blocks A,B,C,D and E, and now the blocks hash to A,X,C,Y and E - so you can just tell the server "the first length(A) bytes are identical, then you need X, then length(C) are unchanged, ..." Of course, that's not byte-optimized, but AFAIK the server will re-compute the delta anyway (for BDB and FSFS). c) identical blocks from different files get shared; so keeping 3 revisions of a 100MB file might need only 100 + epsilon space, and that's without having the blocks compressed. d) small point: the manber-CRC provide a few bits more security that there's no hash-collision. So on commit the content verification goes once through the file, and records offsets and length of the blocks (and whether they did exist in the older version); for transferring the delta only some seeks and writes are needed. The manber-hash block size can be chosen arbitrarily; in FSVS I'm using ~128kB, which means that small files just have a single hash to store, but for larger ones the IO savings get realized. Regards, Phil