John Van Essen wrote: >Inasmuch as I can follow some of the simpler optimizations, I'm at a >loss as to what is being so dramatically improved for large files. > >Can you write up the little piece that you would add to the NEWS file >and describe (in layman's terms is possible) what the benefit is and >under what conditions an improvement will be observed by the user? >'Cause otherwise I wouldn't know what is it that is being tested. > >Thanks. > > John > > Being the originator of this patch, I'll pick it up:
Rsync uses intermediate hashes in order to find blocks in the destination that match existing blocks in the source. This is the heart of how rsync manages to speed up transfers. Now here's the confusing part. These hashes are stored in a hash table (which hash hashes). A hash table is a data-structure construct that more or less guarentees constant time access to any member, assuming a certain threshold (called Alpha) is not passed. Up until now, this hash table had a fixed amount of 65536 buckets. Assuming a you did not mess with the block size, starting from files 2.5GB big the Alpha threshold was passed, and rsync began spending time looking through its own data structure for the the correct hash, instead of simply finding it. The bigger the file, the more time spent on this unnecessary activity. As people (Isk, in this case) are talking about transfering over 500GB of files, we waste a potential large amount of CPU resources on this. The new patch uses a dynamic number of buckets that depends on the number of hash blocks in the file. We thus hope to completely eliminate this bottleneck. Shachar -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html