On Tue, Feb 10, 2009 at 02:45:43PM -0800, Roman V. Shaposhnik wrote:
> On Tue, 2009-02-10 at 17:28 -0500, erik quanstrom wrote:
> > what leads you to beleve that that amount of sharing will be
> > significant?
> 
> Just a hunch so far. I don't have hard data to prove anything.
> On the other hand, I'd be surprised if massive updates (not pulling
> in a couple of months) didn't benefit from the sharing.
> 
> Thanks,
> Roman.

I have mirrored, with vac -f, every sources dump from 2002 to
yesterday with 
      -e acme/acid/386 -e acme/acid/alpha -e acme/acid/arm \
      -e acme/acid/mips -e acme/acid/power -e acme/bin/386 \
      -e acme/bin/alpha -e acme/bin/arm -e acme/bin/mips \
      -e acme/bin/power -e acme/mail/386 -e acme/mail/alpha \
      -e acme/mail/arm -e acme/mail/mips -e acme/mail/power \
      -e sys/man/vol1.ps -e sys/man/vol1.ps.gz -e sys/man/vol1.pdf \
      LICENSE* NOTICE acme lib rc sys ;
intending to get all the source and not the binaries.  I patched my vac to
ignore atimes (replacing the vac metadata field with the mtime) to increase
metadata block sharing.  As of 2009/0205 (a convenient snapshot to du), this
represents about 140.7 MB of data per dump.  The entire copy takes 550 MB
(240 MB actual storage in Venti).  (With no sharing whatsoever, this would
be approx. 310 GB.)  I would like to re-archive this with the Rabin
fingerprinting vac for comparison.

(In case anybody wants to rush out and recreate the results, it took
roughly 10 to 15 minutes per dump to dispatch all the Tstat requests to
sources.)

Incidentally, a git repository of the crawls, from 2002/1212 to 2009/0205,
is available at http://mirrors.acm.jhu.edu/trees/plan9native/ .  Git gets
the data down to 165M after a gc run, so perhaps it's a better idea than a
venti-based mirror.  I haven't managed to make my version of Uriel's port
(thanks for the start! :) ) of git do the right thing in enough cases yet,
so the git repo may not be updated for a while, but I figured somebody might
want to play with it in the interim.

--nwf;

Attachment: pgp70dn2xgB8F.pgp
Description: PGP signature

Reply via email to