On Tue, Jan 02, 2001 at 09:03:34PM -0500, August Zajonc wrote:
> Thanks for some interesting pointers. Unless I'm missing something it seems
> that rsync will still have to go through each filename and see if it
> matches --include. We're kinda pushing it here, but have around 500,000
> files. Files need to be synced at the longest once a minute, and my
> experiance so far is that simply building a large filelist (rsync appears to
> do its entire search and *then* transfer) is relativly timeconsuming, even
> if few of the files match.
Once a minute is too frequent for rsync, you should probably be using
another immediate push method, but anyway ...
> What has worked as a stop gap measure is to pass the filelist as a command
> line option such as
> rsync 'cat filelist' dest::etc
>
> My concern was that there would be some limitation there, but we've managed
> all right with a small 800 file test. My feeling is that if rsync
> accepts a whole set of files on the command line it would be relativly
> straightforward to have it build a filelist from, well, a filelist. I'm
> gonna have one of the guys here take a look at patching it to some behavior
> like that.
If you can get by with using the command line that's probably the best way
to go. However, using the includes/exclude '*' method rsync doesn't
actually have to go through all 500,000 files if they're scattered through
directories that don't all have at least one of your desired files in
them. Rsync recurses through the directory structure, and when you use
--exclude '*' it will not enter into any of the directories that you
haven't explicitly included and thus will skip over many of the files.
In versions 2.3.2 and earlier, rsync had an optimization that I put in such
that if the end of the list was --exclude '*' and the earlier includes
didn't have any wildcards, it would skip the recursive traversal of the
directories and just directly open all the included files. Andrew
Tridgell, the author of rsync, didn't like the fact that it's semantics
wasn't exactly the same as without the optimization though (it didn't
require the parent directories to be explicitly included) and he took out
the optimization in 2.4.0. I asked him if it could be put back and he
asked me to try to pursuade him that it really made a significant
performance difference. At that I tried to come up with some pathologicaly
bad cases and finally had to admit that for my application I really
couldn't show any significant performance hit without the optimization. So
the optimization is gone in the 2.4.* series. You might want to try some
performance measurements in 2.3.2 vs. 2.4.6.
- Dave Dykstra