On Sat, Nov 21, 2009 at 09:08:05AM -0500, Matt McCutchen wrote: > On Fri, 2009-11-13 at 18:58 +0100, H. Langos wrote: > > > > > [--detect-renamed] doesn't calculate name similarity like --fuzzy > > > > > because that would > > > > > be prohibitively expensive in the current implementation. > > > > Only files of the same size should be > > > > candidates to start with, right? > > > > > > No, the name similarity calculation I'm talking about is the fallback to > > > select a similar basis file when no available destination file passes > > > the quick check, so it does not require a size match. > > > > Hmm, ok so fuzzy also finds files that are slightly different and have their > > name slightly changed. > > There's no "slightly" on "different" there. Assuming --fuzzy doesn't > find a quick-check match (and it probably won't because --detect-renamed > has already searched the whole destination with the same criteria), the > choice of basis files is based exclusively on name similarity.
Ok, I see. Does "--fuzzy" check if the filezize is in the same order of magnitude (or at most one order up/down)? Expensive fuzzy string matching on filenames can probably safely be skipped if abs(round(log10(ssize))-round(log10(dsize))) > 1 > > This sounds like it would be a good idea to (have the option to) include > > the delete candidates directory .~tmp~ (or whatever else "--detect-renamed" > > uses) included in the --fuzzy search. > > I'm not clear on what you're proposing here. Could you provide an > example? Ok, I start with this situation, where src and dst are in sync: src/new/foo.jpg src/new/bar.jpg src/2009/ dst/new/foo.jpg dst/new/bar.jpg dst/2009/ Then I run my picture import script. The files are renamed, moved to different directories and some bytes have been added. I end up with something like this: src/new/ src/2009/2009-11-23-foo.jpg src/2009/2009-11-23-bar.jpg dst/new/foo.jpg dst/new/bar.jpg dst/2009/ If I run rsync in this situation, then dst/a/foo.jpg and dst/b/bar.jpg will end up on the destination's to-be-deleted list and --fuzzy would find nothing in dst/2009/ that it could use as base for the "new" files src/2009/2009-11-23-foo.jpg and src/2009/2009-11-23-bar.jpg. What I propose is that, lets call it "--fuzzy-detect-renamed" should not only look in the same directory but also in the to-be-deleted list that "--detect-renamed" uses as temporary asylum for deletion/renaming candidates. Since foo.jpg and bar.jpg are on that to-be-deleted list, my expectation of that new behavior is that foo.jpg would be taken as base for 2009-11-23-foo.jpg and bar.jpg would be taken as a base file for 2009-11-23-bar.jpg > > In fact I do just those things with a script when > > importing pictures from any of my cameras into the photo archive. I > > rename them as shown above and then I move them to a directory structure > > made of <year>/<month>/<day>/ . I don't change the exif tags yet, which > > I wanted to add in the future. > > But that would make the size+mtime/checksum test fail. Using "--fuzzy" > > would help, but only if I'd do an rsync between the moving operation > > and the tag changing operation. > > > > No matter which operation I'd do first, but doing both together would > > mean completely new transfer to my backup location. :-/ > > Right. Note that if you did an rsync between the moving and the tag > changing, you wouldn't need --fuzzy on the second rsync because the > files would already be in the right places. Right. > Efficiently handling simultaneous renames and data changes is very hard > for a stateless tool like rsync. If I understand correctly that you're > moving files without changing their basenames, it would work in this > case to extend --detect-renamed to look for an exact basename match if > there is no quick-check match. I do change the basename too (e.g. I rename "img_1023.jpg" to "2009-10-18_img_1023.jpg") but in a way that fuzzy matching should be able catch). > That would overlap even more with the current --fuzzy functionality. > There may be a better way to factor things. Right. There are a lot of options that change the way rsync looks for quick-check or basefile candidates and due to the organic growth of features their behavior is not always as the users expect. Maybe it is time to think about a more consistent way to control the search for a basefile and quick-check candidate. My first idea would be to add a more explicit form of control. E.g. lists of key value pairs that say _what_ aspect of a file you want to match and _how good_ you need it to match it for passing the quick-check or for usage as a base for the delta transfer. Existing options can easily be translated into that explicit form so that internally there would only be one control logic. Here are some examples of the current options translated into that new schema (I hope I got them right, but keep in mind that this is just a sketch): default behaviour of rsync is something like this: --quick path=same,filename=same,size=same,mtime=same --delta path=same,filename=same when given the "--checksum" optione it is: --quick path=same,filename=same,checksum=same --delta path=same,filename=same with the current "--fuzzy" option it is --quick path=same,filename=same,size=same,mtime=same --delta path=same,filename=same --delta path=same,filename=fuzzy with the current "--detect-renamed" option it is --quick path=same,filename=same,size=same,mtime=same --delta path=same,filename=same --delta path=deleted,size=same,mtime=same this is more easily extendible as new aspects can be added without changing current behaviour and new "qualities" of matching can be added to express stuff like explicit source files (regardless of the src filename): --delta path=some/arbitrary/path/,filename=foo.img a "pool directory" of source files: --quick path=my/pool/path/,filename=same,mtime=same,size=same --delta path=my/pool/path/,filename=fuzzy,size=fuzzy only use files as base if they are smaller: --delta path=same,filename=same,size=smaller you could even express when to skip the delta comparisons completely. e.g. if the destination file was created before the source file (a situation that you encounter when syncing a directory with rotating log files and a rotation has taken place at /src) --whole path=same,filename=same,ctime=older sure this schema is more verbose than the current set of options, but people would use it in scripts rather than on the command line and there you want your commands to be as verbose and explicit as possible. after all you'll want somebody else to understand your scripts without reading all command's man pages and you'll want the behavior to stay constant even when the next mayor version of rsync changes the behavior of one of the summary options. cheers -henrik -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html