Cameron Simpson [[EMAIL PROTECTED]] writes:

>|     Cameron> The other day I was moving a lot of data from one spot to
>|     Cameron> another.  About 12G in several 2G files. [...]
>|     Cameron> so I used rsync so that its checksumming could speed past
>|     Cameron> the partially copied file. It spent a long time
>|     Cameron> transferring nothing and ran out of memory. From the
>|     Cameron> error I'm inferring that it checksums the entire source
>|     Cameron> file before sending anything across the link.
>| I know I'm not (directly) addressing the problem, and I don't know the
>| code, but will specifying a larger block size allow you to work-around
>| the problem?
>
>Perhaps - the transfer is done now but I'll try it next time I have
>such an issue. I was more concerned with the appearance that rsync
>stashes all the checksums before sending any. This seemed memory hungry
>and nonstreaming, which is odd in an app so devoted to efficiency.

Your question about behavior is accurate though - while rsync spends a
lot of energy trying to make an efficient transfer of the file itself,
the actual meta-process to determine the transfer is fairly synchronous.

After exchanging information to determine the set of files involved,
the receiver proceeds through each file in turn, computes the
checksums for the file and then transmits them.  The sender receives
all the checksums, then uses that in conjunction with its copy of the
file to compute the delta information, and then transmits that back.
As the receiver receives the delta information it recreates the new
file.

So there is definitely start-up overhead that must occur before any of
the file data is transferred at all, and for a very large file, the
checksum computation and the transmission of the checksum information
can be lengthy.

Some of this is unavoidable - until the sender has all of the receiver
checksum information it can't necessarily start sending - some of the
very end of the current file on the receiver may be used at the very
beginning of the sender's new version, which it can't detect until it
knows about the entire receiver's file.

Adjusting the blocksize manually can have an impact on this.  The
larger the blocksize, the smaller the checksum meta-information, since
you have linear growth with the number of blocks the file represents.
If a block size is not set on the command line, rsync will do some
dynamic adjustment of the blocksize (roughly size/10000) maxing out at
16K.  During transmission it's 6 bytes per block, but I believe it's
32 bytes in memory.  So for the 2GB file, you'll have about 122,000
blocks for, so ~700K transmitted and ~4MB in memory.  That doesn't
really sound like enough to exhaust memory on typical machines
nowadays though.  There's some per-file growth too, but the per-block
checksums are freed as it works through each file.

Now, in terms of increased efficiency - while you do have to transmit
all of the checksum information before the sender can compute the
delta, one thing I've been interested in trying is to have the sender
send the checksums as it computes them - I'm not entirely sure why it
has to be saved in memory, since it'll be freed right after
transmission.  About the only risk I see is that it couples the
checksum process to the line speed which could raise the risk of
inconsistency if the file on the receiver is changing, but that risk
is already there, just a smaller window.  I haven't had a chance to
try the change though yet.

-- David

/-----------------------------------------------------------------------\
 \               David Bolen            \   E-mail: [EMAIL PROTECTED]  /
  |             FitLinxx, Inc.            \  Phone: (203) 708-5192    |
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150     \
\-----------------------------------------------------------------------/

Reply via email to