> 0) > use of rolling-checksum enables decent block-level deduplication on files > that > are modified in the middle; some info: > http://svana.org/kleptog/rgzip.html > http://blog.kodekabuki.com/post/11135148692/rsync-internals > > in short, a rolling checksum is used to find reasonable restart points; for > us, block boundaries. probably could be overlayed over Venti; > rollingchecksumfs anybody? > > 1) > Git uses diff-based format for long-term compacted storage, plus some gzip > compression. i don't know specifics, but IIRC it's pretty much starndard diff. > > it's fairly CPU- and memory-intensive on larger (10...120MB in my case) text > files, but produces beautiful result: > > i have a cronjob take dump of a dozen MySQL databases; each some 10...120MB > of > SQL (textual). each daily dump collection is committed into Git; the overall > daily collection size grew from some 10MB two years ago to about 410MB today; > over two years about 700 commits. > > each dump differ slightly in content from yesterday's and the changes are > scattered all over the files; it would not de-duplicate block-level too well. > > yet the Git storage, after compaction (which takes a few minutes on a slow > desktop), totals about 200MB, all the commits included. yep; less storage > taken by two years' worth of Git storage than by one daily dump.
given the fact that most disks are very large, and most people's non-media storage requirements are very small, why is the compelling. from what i've seen, people have the following requirements for storage: 1. speed 2. speed 3. speed 4. large caches. - erik