On Saturday 14 January 2012 01:30:32 Bakul Shah wrote: > On Sat, 14 Jan 2012 00:14:25 +0100 Francisco J Ballesteros <n...@lsub.org> wrote: > > but if you insert extra music in front of your track dedup in venti won't > > help. or would it? > > No. Venti operates at block level.
there are two ways around it available: 0) use of rolling-checksum enables decent block-level deduplication on files that are modified in the middle; some info: http://svana.org/kleptog/rgzip.html http://blog.kodekabuki.com/post/11135148692/rsync-internals in short, a rolling checksum is used to find reasonable restart points; for us, block boundaries. probably could be overlayed over Venti; rollingchecksumfs anybody? 1) Git uses diff-based format for long-term compacted storage, plus some gzip compression. i don't know specifics, but IIRC it's pretty much starndard diff. it's fairly CPU- and memory-intensive on larger (10...120MB in my case) text files, but produces beautiful result: i have a cronjob take dump of a dozen MySQL databases; each some 10...120MB of SQL (textual). each daily dump collection is committed into Git; the overall daily collection size grew from some 10MB two years ago to about 410MB today; over two years about 700 commits. each dump differ slightly in content from yesterday's and the changes are scattered all over the files; it would not de-duplicate block-level too well. yet the Git storage, after compaction (which takes a few minutes on a slow desktop), totals about 200MB, all the commits included. yep; less storage taken by two years' worth of Git storage than by one daily dump. perhaps Git's current diff format would not handle binary files very well, but there are binary diffs available out there. -- dexen deVries > Gresham’s Law for Computing: > The Fast drives out the Slow even if the Fast is Wrong. William Kahan in http://www.cs.berkeley.edu/~wkahan/Stnfrd50.pdf