Kyle Lanclos wrote: > Peter Salameh wrote: > > One of the speed-limiting issues with rsync is having to send huge file > > lists when mirroring large file systems, even for incremental updates > > where only a small part of the file system might have changed. > > Personally, I find that the sending of the file list, whether incremental > or otherwise, takes orders of magnitude less time than the construction of > the file list in the first place. The act of stat'ing millions of files > takes an enormous amount of time in comparison to just about anything else, > assuming that you are not on a low-bandwidth link.
I find both take time, and the dominant one depends on the link, and whether that stat information is already in RAM. Usually I only transfer very large file lists over a LAN, though, so it's more like Kyle's situation where the file stat'ing takes longest. The only realistic way to eliminate stat time is some kind of filesystem monitoring and attribute index - similar to the methods used by the dynamic indexes of local search engine style programs. (On Linux that means using inotify, and a daemon which runs all the time. On Windows there are (perhaps) better methods which can survive reboots.) Without that, you can reduce the stat time by scanning the filesystem in a different way. I wrote a program many years ago called "treescan" which did a redcrsive directory traversal while sorting stat calls by inode number from directory's d_ino. On many filesystems, the inode number is approximately related to position on the disk. On those where it was, the heuristic sped up whole filesystem scans by a factor of about 2, and on some directory structures by a factor of about 100. It's possible some parallel stat calls would improve this further on some OSes and kernel versions, by allowing better head seek optimisation at the kernel level. But other OSes or kernel versions would be slowed by it. > What would be ideal, I think, is for rsync to scan the filesystem while > a transfer is in place; I think rsync 3 does this, it's called incremental scan mode. > with a configurable quantity of file transfer threads, combined with > a configurable quantity of filesystem "spider" threads, would result > in the most optimal interleaving of disk latency and time required > to transfer files. Be careful when using multiple unsynchronised threads to access a filesystem. It sometimes thrashes the disk - seeking back and forth between different files - resulting in much worse latency than just doing one file at a time. That said, it can work out better. Just have to be careful how it's done. -- Jamie -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html