2000-11-03-13:02:22 Andy Small:
> 2000-11-02-17:56:56 Bennett Todd:
> > There's been discussion about doing away with this, letting
> > rsync work through something like a depth-first sorted-order
> > traversal or some such, both ends could do that in synch, with
> > memory requirements often down to O(logN) from O(N), but as far
> > as I know it hasn't progressed beyond talk; presumably, nobody
> > with the expertise and tuits to try to tackle the coding is
> > severely inconvenienced by the performance consequences of the
> > current implementation.
>
> You are describing RDIST.
Nope, I'm describing a refinement to the way rsync proceeds. Rather
than completely precomputing the entire file list in memory, if both
sides can perform synchronized depth-first scans (by simply sharing
a sort order applied to each directory as the search reaches it)
they can avoid having to hold more than O(logN) file data structures
in memory at once, and can commence the actual file sync almost
immediately, rather than waiting until the entire tree has been
scanned (and a structure built representing it in memory) before
beginning the sync.
> The problem with RDIST is that it doesn't react well to situations
> with (a) high latency, and/or (b) thousands of files to synch.
rsync isn't all that great with sufficiently huge numbers of files
to sync, but only due to the startup scan; once that's done it
hustles. rsync doesn't share rdist's problem with high latency or
its awful performance hit with huge numbers of files, because rsync
streams the information back and forth. Where rdist basically uses
an RPC-style synchronous handshake for each file, rsync multiplexes
several asynchronous data flows back and forth. Very roughly, after
an initial handshake (propogated in the cmdline when done over rsh)
to negotiate what job to do --- what directory heirarchy is being
synchronized --- the server builds a compact description of the
files it has, and sends that streaming over to the receiving end;
the receiver notes which files are completely missing, which files
are identical, and which files need (due to size or timestamp
mismatch, or checksum mismatch with --checksum) to be
resynchronized. It then streams back to the sender a list of files
it needs, and for those files it has possibly-older copies of, it
also streams back block checksums. The sender then walks through its
src tree; where the files are missing on the receiving end they are
simply sent; where files are present on the receiving end the magic
"rsync algorithm" fires up, and the sender finds any blocks (by
checksum) which the receiver already has, and sends instructions for
building the new file in the form of chunks of file contents
together with the offsets of blocks of the old dst file that can be
re-used in assembling the new dst file. This is also streamed to the
dst, where the receiver follows these instructions to assemble the
new dst files.
Nowhere in this flow are there rsync-style, RPC-ish synchronous
back-and-forths waiting for answers; instead, the traffic consists
of three streams, two (multiplexed) from the sender to the receiver
and one from the receiver back to the sender. These all enjoy the
full ability of TCP to do it's sliding window thing and fill
high-latency pipes.
> Say your round-trip time is 1000 ms (not unusual for international
> links) and you have 30,000 files to synch... RDIST takes 8 hours
> (!!) to work.
Yup, that's the price for the conceptually simple but pragmatically
inefficient RPC-per-file approach to organizing the traffic. rsync
does not suffer from this problem.
> If you could drop that filelist from memory to a configurable
> place (like /tmp/rsync.filelist), and gave it a configurable
> time-out (like 10 minutes), I would be very happy!
That's been proposed. It has not yet been implemented.
-Bennett
PGP signature