On Sat, Apr 16, 2005 at 11:11:00AM -0400, C. Scott Ananian wrote: > On Sat, 16 Apr 2005, Martin Uecker wrote: > > >The right thing (TM) is to switch from SHA1 of compressed > >content for the complete monolithic file to a merkle hash tree > >of the uncompressed content. This would make the hash > >independent of the actual storage method (chunked or not). > > It would certainly be nice to change to a hash of the uncompressed > content, rather than a hash of the compressed content, but it's not > strictly necessary, since files are fetched all at once: there's not 'read > subrange' operation on blobs. > > I assume 'merkle hash tree' is talking about: > http://www.open-content.net/specs/draft-jchapweske-thex-02.html > ..which is very interesting, but not quite what I was thinking. > The merkle hash approach seems to require fixed chunk boundaries.
I don't know what is written there, but I don't consider fixed chunk boundaries part of the definition. > The rsync approach does not use fixed chunk boundaries; this is necessary > to ensure good storage reuse for the expected case (ie; inserting a single > line at the start or in the middle of the file, which changes all the > chunk boundaries). Yes. The chunk boundaries should be determined deterministically from local properties of the data. Use a rolling checksum over some small window and split the file it it hits a special value (0). This is what the rsyncable patch to zlib does. > Further, in the absence of subrange reads on blobs, it's not entirely > clear what using a merkle hash would buy you. The whole design of git is a hash tree. If you extend this tree structure into files you end up with merkle hash trees. Everything else is just more complicated. Martin -- One night, when little Giana from Milano was fast asleep, she had a strange dream.
signature.asc
Description: Digital signature