On Saturday 14 January 2012 01:30:32 Bakul Shah wrote:
> On Sat, 14 Jan 2012 00:14:25 +0100 Francisco J Ballesteros <n...@lsub.org>  
wrote:
> > but if you insert extra music in front of your track dedup in venti won't
> > help. or would it?
> 
> No. Venti operates at block level.


there are two ways around it available:

0)
use of rolling-checksum enables decent block-level deduplication on files that 
are modified in the middle; some info:
http://svana.org/kleptog/rgzip.html
http://blog.kodekabuki.com/post/11135148692/rsync-internals

in short, a rolling checksum is used to find reasonable restart points; for 
us, block boundaries. probably could be overlayed over Venti; 
rollingchecksumfs anybody?

1)
Git uses diff-based format for long-term compacted storage, plus some gzip 
compression. i don't know specifics, but IIRC it's pretty much starndard diff.

it's fairly CPU- and memory-intensive on larger (10...120MB in my case) text 
files, but produces beautiful result:

i have a cronjob take dump of a dozen MySQL databases; each some 10...120MB of 
SQL (textual). each daily dump collection is committed into Git; the overall 
daily collection size grew from some 10MB two years ago to about 410MB today; 
over two years about 700 commits.

each dump differ slightly in content from yesterday's and the changes are 
scattered all over the files; it would not de-duplicate block-level too well.

yet the Git storage, after compaction (which takes a few minutes on a slow 
desktop), totals about 200MB, all the commits included. yep; less storage 
taken by two years' worth of Git storage than by one daily dump.

perhaps Git's current diff format would not handle binary files very well, but 
there are binary diffs available out there.


-- 
dexen deVries

> Gresham’s Law for Computing:
>   The Fast drives out the Slow even if the Fast is Wrong.

William Kahan in
http://www.cs.berkeley.edu/~wkahan/Stnfrd50.pdf

Reply via email to