Date: Tue, 27 Nov 2001 10:49:11 -0600 From: Dave Dykstra <[EMAIL PROTECTED]>
Thank you very much for doing the test Alberto. I didn't have any set of files that large on which I could do a test, and as I said when I tested the worse case I could think of with my application I couldn't measure an appreciable difference. First, I want to make sure that you really did get the optimization turned on. [ . . . 3 paragraphs of clues on verifying optimization omitted . . . ] I know you're trying to get reliable statistics so it's clear what sort of performance we're talking about here. But may I respectfully suggest that -having- to be so careful about whether optimization actually got turned on is a clue that there is still a big problem here? Seriously, even if --files-from= was -not- as efficient as the optimized case, if it's so difficult to ensure that you -are- in the optimized case, what's the point? If 90% of the users get it wrong--- and 90% of -those- can't even figure out how to -tell-, even if they're trying to be careful---then clearly the optimization isn't as useful as it might be. (And btw, if it's that hard to figure out, there should be a debugging switch that -tells- the user whether it got turned on. Yet another out-of-control command-line option, or perhaps an addition to one of the verbose modes, but not one that forces the user to drown in lots of other output, or cause unpatched rsyncs to hang, or... People shouldn't have to patch their local rsync just be sure this is happening.) Meanwhile, people are tying themselves in knots trying to figure out how specify which files to transfer. As I pointed out months ago when this subject first came up, it seemed that about half the traffic on the list was from people who were confused about how to specify the list of files that rsync was supposed to handle. Letting them use other tools (e.g., find, or some perl script they just wrote) that were more transparent and with which they were more familiar seemed like it would dramatically decrease their learning curve. I would propose that, -whether or not- the use of --files-from= was a performance-killer, rsync should have it. It -would- allow people to quickly debug a working setup. -If- for some reason its performance was bad compared to include/exclude, -then- they could go from a known-working configuration that might not run at full speed to a more-difficult-to-debug one that did. This is the right direction. (If life was really that bad, it might not be hard for the statistics from a run to indicate how much time was spent traversing the file system vs moving files over the connection, which would be a clue that it was time to move to the "optimized" case. But it'd be nice to just avoid having to think about this hair in the first place.) And, of course, if the data we've seen -was- generated with optimization, then obviously there's no downside to --files-from=. It seems pretty clear that the data presented paints a bad picture. It's hard to believe that --files-from= could be worse. P.S. Would --files-from= reduce rsync's large memory consumption as well, or does it still imply rsync caching some info about every file it sees during its entire run, and never flushing this info until the end? Not remembering something about each file for the entire run would alone be a powerful reason to include it---there are some tasks for which finishing -at all- is more important than waiting a while. It sucks to tell the user, "You can't use the slower approach at all because we thinmk you should always be fast. Go buy more memory instead---if the machine is under your control, can take more memory in the first place, etc." I don't recall whether -both- ends of the connection are so memory-intensive; if so, this is even more important.