The cause is, of course, that the tree being syncronized ie getting larger, so of course rsync is slowing down. But in the case of my particular file tree, there is a way it could be speeded up, but this would obviously also need a change in the rsync protocol to accomplish it. Any tree that has major unchanged subbranches would benefit from this.
The file tree I'm syncronizing in this case has archived data that is being deposited under a YYYY/MM/DD directory structure. Hundred to thousands of files are added each day, and I'm even considering breaking it down further by hour. In theory, I could do the syncronizing by date. On the receiving side, which is also the initiating side, I could have it learn when the last time it fulling completed syncronizing, and re-run that date as well as any subsequent dates, rather than the entire tree (which could easily reach a quarter million files or more per year). But I have one catch. Occaisionally, an older file is updated. And that older file needs to be syncronized as well. The receiving/initiating side won't know whether an old file is, or is not, updated. What I think would be an improvement in rsync speed in this scenario, and some similar ones where lots of tree branches are not updated for extended periods of time, is to collect the timestamps (and checksums if that is enabled) for each entire branch, hash them, and transfer only the hash of that metadata. It would need to be a strong hash like MD5 of SHA1 since if the hashes are equal, the tree branch would be skipped, and none of the filenames within would be transferred. For branches where the hash is not equal, then the same hashing would be done recursively on the sub-branches until either unchanged sub-branches are found and skipped, or changed files are found (and transferred). The catch with this mechanism is that nothing would be exchanged between the rsync processes until the entire tree had been scanned and all the time stamps collected (worse if doing checksums). On slow (relative to the total volume of data to be syncronized) networks, though, this could still be a major time savings, as well as traffic savings. But it clearly would have to have a special option to enable it. Has anything like this been considered before? -- ----------------------------------------------------------------------------- | Phil Howard KA9WGN | http://linuxhomepage.com/ http://ham.org/ | | (first name) at ipal.net | http://phil.ipal.org/ http://ka9wgn.ham.org/ | ----------------------------------------------------------------------------- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html