hi everybody I have implemented a good idea for reducing download stress for everybody who is mirroring a lot of data using rsync, like, the people who are mirroring Debian GNU/Linux: currently, many Debian "leaf mirrors" are using rsync for mirroring from the main .debian.org hosts.
rsync contains a wonderful algorithm to speedup downloads when mirroring files which have only minor differences; only problem is, this algorithm is ALMOST NEVER used when mirroring a debian repository ... indeed, whenever a new version of a package is entered in the debianrepository, this package has a different name: for this reason rsync does just a full download. Summarizing, rsync currently does some speedup only when it downloads Packages.gz files, or when it skips an already existing package. well, I have just implemented a simple way to use the algorithm even when downloading the .debs . here is a simple example suppose the current situation is $REMOTE::/pub/debian/dist/bin/dpkg_2.deb whereas locally we have /debian/dist/bin/dpkg_1.deb when rsync looks for a local version of /debian/dist/bin/dpkg_2.deb if there is none, then rsync does ls -t /debian/dist/bin/dpkg_* and looks for the most recent file it finds this way, rsync will use the file /debian/dist/bin/dpkg_1.deb to try to speedup the download of $REMOTE::/pub/debian/dist/bin/dpkg_2.deb (using its fabulous algorithm) BIG PRO: my new "rsync" is totally compatible with the old one Conclusion: this idea would make all debian mirror-people happier (specially if they mirror "unstable"; consider that, often, when a new version of a package is released, only small changes are made... sometimes, only the .postinst , or such, are really changed; this may , thou, masked by the compression, alas: but, see TODO) I attach two files: the first file is a diff, showing where, in the "rsync 2.4.1" source code tree, I have done some modifications; the second is a .tgz of the all the new and modified files you need to build the new rsync: to build, first you need to download the source code (see rsync.samba.org/rsync/download.html) and then you unpack the file rsync.diffsrc.tgz in the tree code, and build. You may also get the compiled binary directly as ftp://tonelli.sns.it/pub/rsync/rsync and the new code alltogether in ftp://tonelli.sns.it/pub/rsync TODO: there are some potentially good ideas here: 1) the idea is to add "modules" to rsync: a "gzip" module, a "deb" module, and "rpm" module...; currently, modules just look for an older local version of the file; in a future version, any module would apply to a certain type of file, and create another file to pass to "rsync" so that this another file may probably lead to more speedup: e.g., the "gzip" module would unzip files before doing comparisons, and the "deb" module would unzip the data.tar.gz part of a package CONS: this would not be backward compatible, of course The idea is, a module may provide the following calls: find_alternative_version_MOD() receive_file_MOD() send_file_MOD() Currently, only find_alternative_version_deb() was implemented. If rsync uses only the find_alternative_version_MOD() calls, then it is "backward compatible" with the usual version: (in a sense , it is doing what the option --compare-dest already does, only in a smarter way) I have not currently implemented any receive_file_MOD() send_file_MOD() : these would need a change in the protocol: I hope that the rsync authors will give permission 1b) My idea (not sure) is that "rsync" may work if provided with "named pipes" instead of files: indeed, according to the technical report, it needs to read the local and remote files only once, and then, it writes the local file, without ever seeking backwards; then, the above modules would not need to actually use disk space and create temporary files. 2) for a faster apt-get downloading, it may be possible to do the same trick WHEN UPGRADING INSTALLED PACKAGES! Here is the idea: "apt-get creates a local version of the package (using dpkg-repack) and do the rsync to get the remote version" -- Andrea C. Mennucci, Scuola Normale Superiore, Pisa, Italy
? modules ? zlib/dummy Index: Makefile.in =================================================================== RCS file: /cvsroot/rsync/Makefile.in,v retrieving revision 1.39 diff -r1.39 Makefile.in 24c24 < lib/fnmatch.h lib/getopt.h lib/mdfour.h --- > lib/fnmatch.h lib/getopt.h lib/mdfour.h modules/modules.h 32c32,33 < OBJS=$(OBJS1) $(OBJS2) $(DAEMON_OBJ) $(LIBOBJ) $(ZLIBOBJ) --- > MODULES_OBJ = modules/modules.o modules/deb.o > OBJS=$(OBJS1) $(OBJS2) $(DAEMON_OBJ) $(LIBOBJ) $(ZLIBOBJ) $(MODULES_OBJ) Index: generator.c =================================================================== RCS file: /cvsroot/rsync/generator.c,v retrieving revision 1.16 diff -r1.16 generator.c 19a20,23 > #ifndef NODEBIANVERSIONER > #include "modules/modules.h" > #endif > 311c315,349 < fnamecmp = fnamecmpbuf; --- > { > fnamecmp = fnamecmpbuf; > if (verbose > 1) > rprintf(FINFO,"recv_generator opens %s\n",fnamecmp); > } > } > #ifndef NODEBIANVERSIONER > /* by A Mennucci. GPL > this piece will look for a previous version > of the same file > I think that rsync is somewhat a "spaghetti code": > look at how many extern declarations it uses.... > and it is crazy that this check has to be done in two separate places > */ > if (statret == -1) { > char *nf; > int saveerrno = errno; > nf=find_alternative_version(fname); > if ( nf != NULL) > { > statret = link_stat(nf,&st); > if (!S_ISREG(st.st_mode)) > statret = -1; > if (statret == -1) > { > perror("stat of suggested older version failed:"); > errno = saveerrno; > } > else > { > fnamecmp = fnamecmpbuf; > strcpy(fnamecmp, nf); > } > free (nf); > } 312a351 > #endif Index: receiver.c =================================================================== RCS file: /cvsroot/rsync/receiver.c,v retrieving revision 1.28 diff -r1.28 receiver.c 18a19,21 > #ifndef NODEBIANVERSIONER > #include "modules/modules.h" > #endif 21a25 > 375a380,401 > #ifndef NODEBIANVERSIONER > /* by A Mennucci. > this piece will look for a previous version > of the same file */ > if ((fd1 == -1)) { > char *nf; > nf=find_alternative_version(fname); > if (nf!= NULL) > { > fnamecmp = fnamecmpbuf; > strcpy(fnamecmpbuf,nf); > fd1 = do_open(nf, O_RDONLY, 0); > if(fd1==-1) > perror("file candidate"); > free(nf); > } > } > if (fd1 != -1 ) > rprintf(FINFO, > "((candidate local oldfile for %s is %s))\n", > fname,fnamecmp); > #endif
rsync.diffsrc.tgz
Description: GNU Unix tar archive