Date: Wed, 11 Jul 2007 01:26:18 -0400 From: "George Georgalis" <[EMAIL PROTECTED]>
the program is http://www.ka9q.net/code/dupmerge/ there are 200 lines of well commented C; however there may be a bug which allocates too much memory (one block per file); so my application runs out. :\ If you (anyone) can work it out and/or bring it into rsync as a new feature, that would be great. Please keep the author and myself in the loop! Do a search for "faster-dupemerge"; you'll find mentions of it in the dirvish archives, where I describe how I routinely use it to hardlink together filesystems in the half-terabyte-and-above range without problems on machines that are fairly low-end these days (a gig of RAM, a gig or so of swap, very little of which actually gets used by the merge). Dirvish uses -H in rsync to do most of the heavy lifting, but large movements of files from one directory to another between backups won't be caught by rsync*. So I follow dirvish runs with a run of faster-dupemerge across the last two snapshots and across every machine being backed up (e.g., one single run that includes two snapshots per backed-up machine); that not only catches file movements within a single machine, but also links together backup files -across- machines, which is quite useful when you have several machines which share a lot of similar files (e.g., the files in the distribution you're running), or if a file moves from one machine to another, etc, and saves considerable space on the backup host. [You can also trade off speed for space, e.g., since the return on hardlinking zillions of small files is relatively low compared to a few large ones, you can also specify "only handle files above 100K" or whatever (or anything else you'd like as an argument to "find") and thus considerably speed up the run while not losing much in the way of space savings; I believe I gave some typical figures in one my posts to the dirvish lists. Also, since faster-dupemerge starts off by sorting the results of the "find" by size, you can manually abort it at any point and it will have merged the largest files first.] http://www.furryterror.org/~zblaxell/dupemerge/dupemerge.html is the canonical download site, and mentions various other approaches and their problems. (Note that workloads such as mine will also require at least a gig of space in some temporary directory that's used by the sort program; fortunately, you can specify on the command line where that temp directory will be, and it's less than 0.2% of the total storage of the filesytem being handled.) * [Since even fuzzy-match only looks in the current directory, I believe, unless later versions can be told to look elsewhere as well and I've somehow missed that---if I -have- missed that, it'd be a nice addition to be able to specify extra directories (and/or trees) in which fuzzy-match should look, although in the limit that might require a great deal of temporary space and run slowly.] -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html