Re: feedback on rsync-HEAD-20050125-1221GMT

Chris Shoemaker Fri, 28 Jan 2005 17:50:55 -0800

On Fri, Jan 28, 2005 at 03:42:25PM -0500, Alberto Accomazzi wrote:
> Chris Shoemaker wrote:
> 
> >If I understand Wayne's design, it would be possible to invent a
> >(per-directory) "hook" rule, whose value is executed, and whose stdout
> >is parsed as a [in|ex]clude file list.  E.g.:
> >
> > -R "cat .rsync-my-includes"
> >
> >or
> >
> > -R "find . -ctime 1 -a ! -fstype nfs -a ! -empty -o iname 'foo*'"
> 
> This is certainly a very powerful mechanism, but it definitely should 
> not be the only way we implement file filtering.  Two problems:
> 
> 1. Sprinkling rule files like these across directories would mean 
> executing external programs all the time for each file to be considered.

No, only one execution per specified rule.  Most users of this feature
would put specify one rule at the root directory.  But, if a user
wanted to change the rules for every directory, they would have to
specify a rule in each directory.  Then, yes, one execution per
directory.  Presumably they would do this because they actually need
to.  Never one execution per file.

>  This would presumably slow down rsync's execution by an order of 
> magnitude or so and suck the life out of a system doing a big backup job.

If you're referring to process spawning overhead, it's no big deal.
If you're referring to the actual work required to return the file
list, what makes you think that rsync can do it more efficiently than
'cat' or 'find', or whatever tool the user chose?

> 
> 2. Who does actually need such powerful but yet hard-to-handle 
> mechanism?  Most of rsync's users are not programmers, and even us few 
> who are apparently still get confused with rsync's include/exclude 
> logic, forget about even more complicated approaches.

Do you mean include/exclude mechanism or filtering mechanism?  Well,
IMO, parsing a file list is *less* complicated than rsync's custom
pattern specification and include/exclude chaining.  Actually, I think
rsync patterns are /crazy/ complicated and fully deserve the pages
upon pages of documentation, explanation and examples that they get in
the man page.

But, complexity is somewhat subjective, so I won't argue (much) about
it.  In practice, /familiarity/ is far more important than complexity
in a case like this.  Someone who looks at rsync for the first time
has a _zero_ chance of having seen something like rsync's patterns
before, because there is nothing else like them.  (The allusion to GNU
tar's --exclude option which takes only a filename, not a pattern,
isn't really helpful in understanding rsyncs --exclude option.)

OTOH, that same person has a (much) greater than zero chance of
already knowing how to use 'cat' or 'find' or whatever to specify a filelist.
That's good reason to prefer the latter, IMO, even if it is *more*
complex, (which is pretty hard to imagine.)

Don't get me wrong, I recognize that rsync's pattern rules _resemble_
some other things, like regexp or bash-like expansions, but parts are
unique to rsync and there's big difference between "I already know how
to use it" and "I have to spend 45 minutes figuring out what parts
resemble something I already know how to use."

> 
> >IMHO, rsync already has too much of its own "filtering" functionality,
> >and needs less, not more.  But maybe a hook like this that lets users
> >interface with their own filtering program is a step toward
> >deprecating rsync's [in|ex]clude[-from] options.
> >
> >Notice that a generic include and exclude hooks immediately obsoletes
> >the --*-from options and the --*=PATTERN options.  (rsync needs fewer
> >options, ya see? :)
> 
> I totally agree with you.  Having now read the description of the 
> --filter option in CVS's manpage (duh!)  I think what wayne is working on 
> is right on the money and will satisfy 95% of rsync's power users (most 
> of rsync's regular users needs are already met by the current 
> include/exclude rules).

Wayne's too nice.  He gets to actually _maintain_ all of this
complexity in rsync, and he does one helluva job.  If it were me, I
would mercilessly offload all pattern matching to some external
interface and deprecate all (or almost all) of rsync's pattern
matching support.  Since he maintains what he writes, by definition,
he really can't be going wrong.  That said, --filter *may* be my idea
of the "right path", but I won't be convinced until Wayne starts
*deleting* man page text, because rsync's pattern matching can be
fully explained in, say, one or two paragraphs.

It's not that pattern matching for file selection isn't complex --
it's just that it's such a well-defined, conceptually simple, common
task that other tools (like 'find' and 'bash') handle better than
rsync ever will.  And that's the way it should be: it's the unix way.

> 
> >>Wayne Davison wrote:
> >>
> >>
> >>>It already supports per-directory name rules, both inherited and not.
> >>>The idea of having per-directory size and time limits would not be hard
> >>>to add, and may be quite worthwhile.  For instance, assume 's' is for
> >>>size and 't' is for the modified time:
> >>>
> >>>  # Don't transfer files 1 GB or larger
> >>>  s< 1g
> >>>  # Don't transfer files 100 KB or smaller
> >>>  s> 100k
> >>>  # Only transfer new files (modified in the last day)
> >>>  t> yesterday
> >>>
> >>>Something like that, perhaps.
> >
> >
> >We don't really want to reinvent 'find', do we?
> 
> Well, no, that's why I was advocating adopting its syntax and reusing 
> its code so that rsync can do similar operations on a per-directory 
> basis, but as I said maybe this is already overkill.  

The best way to "reuse" code is to execute it externally.  It's not overkill.

> I am against a 
> solution that would execute the find as an external program for each 
> file considered for performance reasons, though.  

see above.  User chooses granularity, up to per-directory.  In
reality, external filelist generation is probably *more* efficient.

> If you really need 
> complete freedom maybe the way to go is to do your file selection first 
> and use --files-from.  

Yes, --files-from is nice, and honestly, almost completely sufficient.
But in some dynamic cases, you can't keep the list updated.

> The reason to implement a good --filter option is 
> because it sits in a sweet spot between the --include/exclude and the 
> --files-from scenarios.  It still lets rsync do all the work of figuring 
> out the file list with just a little effort from the user.  The real 

"just a little effort"?  The effort that matters is not the effort of
typing the characters "--filter foo" - it's the effort of learning
what to type, which will _always_ be substantial unless I've already
learned it before.

> challenge is making this powerful without making it too complicated, 
> because in that case nobody will use it.

You see --filter as less complicated than --include/exclude, then?
It's certainly more powerful.

 -chris

-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Re: feedback on rsync-HEAD-20050125-1221GMT

Reply via email to