On Thu, Apr 27, 2006 at 05:11:41PM -0400, Matt McCutchen wrote: > I believe the sender should compute and send the block hashes.
This has come up before, and it may well be something to make available as an option, but I don't think it should be the default for a number of reasons: (1) This increases the amount of data flowing in the transfer direction, removing the largest part of the data currently flowing in the opposite direction. This is a bad thing if the network is the bottleneck for the transfer. The current algorithm helps to balance out the network use by sending the checksum data in the opposite direction of the transfer. (2) This increases the disk I/O on the sending side because it would need to read the file twice (assuming that cached checksums aren't available). Since rsync is running in a pipelined mode, the sender will be iterating over several future files by the time it gets back a request for data chucks from the receiving side. Hopefully the data will still be in the disk cache, but if it is not, the transfer would bog down to a significant degree. (3) This increases the time during which a change on the sending side results in the sending of a bogus file to the receiver. The reason is that the checksums are sent some amount of elapsed time prior to when the requests for file hunks begin arriving. If changes happen during this period, the file will not be constructed correctly. The current algorithm has a similar problem, but is sensitive to changes occurring on the receiving side. For many applications, the receiving side only changes when the files are updated by rsync (e.g. backups), so the current algorithm has the advantage that the checksum-aging window is on the side that is less likely to be changing. (4) Either the number of active temporary files increases, or the memory needed for remembering the matching blocks increases. Either way, it looks like the disk I/O on the receiving side also increases. This occurs because the receiver first gets the checksum data a good time before the sender data starts arriving for the file. One algorithm to handle this is to start constructing a sparse temporary file as the basis data is read, keeping the temp file open until the server data arrives to finish the file. The bad thing about this algorithm is that it makes it hard to compute the full-file checksum without re-reading all the matched data from the temp file (it also increases the number of active temporary files, which increases the demands on free disk space). The alternative is to not do any temp- file building until the sender data starts arriving, at which point we'd need to re-read the matched data from the basis file -- this would require caching (in memory?) all the matching block info we computed for all the files that are awaiting sender-side data. One solution to some of the above is to remove the pipelining from the transfer so that rsync is operating on one file at a time, adding delays to the transfer as each file is requested in sequence. The main benefit of a reverse-algorithm is that the sender does not need as much CPU, especially if the checksum data is cached. It seems to me that the extra disk I/O needed on both sides might outweigh this advantage, especially for large files that cannot remain in the disk- cache long enough for the necessary re-reading. ..wayne.. -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html