On Wed, Oct 26, 2005 at 02:04:34PM -0400, Chris Shoemaker wrote: > That option should imply at least, --checksum and --delete-after if > --delete at all.
I don't think it needs --checksum because rsync can simply use a non-exact match as the basis file for the transfer. > For each file on the sender which is *missing* from the receiver, it > needs to search the checksums of all of receiver's existing files for > a checksum match. I'd make it: (1) lookup a file-size + mod-time + file-name match; if found, copy that file locally and consider the update done. (2) lookup a file-size + mod-time match OR just a file-name match, and use that file as a basis file in the transfer, which can greatly speed it up the transfer if the file is largely the same as the new file. The way I see this being implemented is to add a hash-table algorithm to the code so that rsync can hash several things as the names arrive during the opening file-list reception stage: the receiving side would take every arriving directory name (starting with the dest dir) and lookup the names in the local version of that dir, creating a hash table based on file-size + mod-time, a hash table based on file-name (for regular files), and a hash table based on any directory names it finds (this attempts to do the receiving side scanning incrementally as the names arrive instead of during a separate pass after the file-list is finished). As each directory gets scanned, that name gets removed from the directory-name hash. At the end of the file-list reception, any remaining directory names in the dir-hash table also get scanned (recursively). This would give us the needed info in the generator to allow it to lookup missing files to check for exact or close matches. One vital decision is picking a good hash-table algorithm that allows the table to grow larger efficiently (since we don't know how many files we need to hash before-hand). I'm thinking that trying the libiberty hashtab.c version might be a good starting point. Suggestions? Perhaps a better idea than a general-purpose hash-table algorithm might be to just collect all the data in an array (expanding the array as needed) and then sort it when we're all done. This would use a binary-search algorithm to find a match. The reason this might be better is that it is likely that the number of missing files will not be a huge percentage of the transfer, so making the creation of the "hash table" efficient might be more important than making the lookup of missing files maximally efficient. Have you done any work on this, Chris? If not, I'm thinking of looking into this soon. ..wayne.. -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html