> 0)
> use of rolling-checksum enables decent block-level deduplication on files 
> that 
> are modified in the middle; some info:
> http://svana.org/kleptog/rgzip.html
> http://blog.kodekabuki.com/post/11135148692/rsync-internals
> 
> in short, a rolling checksum is used to find reasonable restart points; for 
> us, block boundaries. probably could be overlayed over Venti; 
> rollingchecksumfs anybody?
> 
> 1)
> Git uses diff-based format for long-term compacted storage, plus some gzip 
> compression. i don't know specifics, but IIRC it's pretty much starndard diff.
> 
> it's fairly CPU- and memory-intensive on larger (10...120MB in my case) text 
> files, but produces beautiful result:
> 
> i have a cronjob take dump of a dozen MySQL databases; each some 10...120MB 
> of 
> SQL (textual). each daily dump collection is committed into Git; the overall 
> daily collection size grew from some 10MB two years ago to about 410MB today; 
> over two years about 700 commits.
> 
> each dump differ slightly in content from yesterday's and the changes are 
> scattered all over the files; it would not de-duplicate block-level too well.
> 
> yet the Git storage, after compaction (which takes a few minutes on a slow 
> desktop), totals about 200MB, all the commits included. yep; less storage 
> taken by two years' worth of Git storage than by one daily dump.

given the fact that most disks are very large, and most people's non-media
storage requirements are very small, why is the compelling.

from what i've seen, people have the following requirements for storage:
1.  speed
2.  speed
3.  speed
4.  large caches.

- erik

Reply via email to