Shachar,

It does use a hash table.  rsync adds the two components of the rolling checksum together to come up with a 16 bit hash, and performs a table lookup.  The table contains an offset into the sorted rsync checksum table (which contains both weak and strong checksums), and does a linear search from there.

As outlined in Andrew's original thesis, if you are working with huge files, the initial hash table lookup may always hit, which means that a linear lookup search is necessary - but that linear search is starting from a known location in the full list, so that isn't that big of a deal.

I would *strongly* recommend that you dig into the thesis a bit (just the section that describes the rsync algorithm itself).  It took me awhile to fully appreciate the nuances of the algorithm, but once you get it, it is incredibly elegant - and very, very efficient for generalized binary difference computation.  You may have a specific need that could be more optimally performed than what rsync provides (rsync is a generalized algorithm - if you have knowledge about the files you are wanting to sync, then you may very well be able to do better than rsync using a custom built technique).


Now, if you have huge files, then the 16 bit checksum may not be sufficient to keep the hit rate down.  At that point, you may want to consider a few algorithmic alternatives:

1.  Break up your file into logical blocks and rsync each block individually.  If the file is an "append only" file, then this is fine.  However, if the contents of the file get re-ordered across block boundaries, then the efficiency of the rsync algorithm would be seriously degraded.

2.  Use a larger hash table.  Instead of 16 bits, expand it to 20 bits - it will require 16 times as much memory for the hash table, but that may not be an issue for you - you are probably workring with some relatively beefy hardware anyway, so what the heck.  I'll leave the excercise of converting the full rsync 32 bit rolling checksum into a 20 bit value to you.

Cheers,

Kevin




Original Message
--------------------
> Wayne Davison wrote:

>However, you should be sure to have measured what is causing the
>slowdown first to know how much that will help.  If it is not memory
>that is swapping on the sender, it may be that the computing of the
>checksums in maxing out your CPU, and removing the caching of the
>remote checksums won't buy you as much as you think.  You could use some
>of the librsync tools (e.g. rdiff) to calculate how long various actions
>take on each system (i.e. try running rdiff on each system outputting to
>/dev/null to see how long the computing of the checksums takes).
>
>..wayne..
>  
>
Hi Wayne,

Excuse me if I'm talking utter nonsense here. I have only just now
opened the code up and looked at it. It does seem, however, that there
is a considerable optimization that can be performed here.

Correct me if I'm wrong, but it seems to me that the checksum matching
code is at match.c, inside hash_search. Particularly, the "do...while"
loop. It seems that the loop is there to scan the entire checksums list
for each byte. Is that really the case? If so, we can probably make it
much much (much much much) more efficient by using a hash table instead.
We wouldn't even have to change the line protocol in any way.

Am I misreading the code?

         Shachar

--
Shachar Shemesh
Lingnu Open Source Consulting ltd.
Have you backed up today's work? http://www.lingnu.com/backup.html

--
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

<

--------------------
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Reply via email to