Shachar-
 
I think Wayne is mostly pointing you to the correct location here.  If you look at the code where the checksum is computer, you'll find that the rolling checksum actually consists of two parts - one is the sum of all of the byte values in the window, the other is the offset weighted sum of all of the byte values in the window.  The first of these is then left shifted 16 bits and added to the other to come up with the official "32 bit" rolling checksum.  This works fine as long as you aren't counting on a random distribution of bits among the 32 - if you mod the value, you are giving much greater importance to the lower XX bits, effectively dropping the distribution of the high order bits...
 
Anyway, the two 16 bit values may be random enough that my concern is not founded, but it should be tested before assuming that the rolling checksum is really a 32 bit value that can easily be divided up into buckets.
 
PS - none of the above has anything to do with the strong signature of the window - just the rolling check sum.
 
Cheers!
 
- K
 
 
Original Message
--------------------
> On Sat, Mar 05, 2005 at 02:18:01PM +0200, Shachar Shemesh wrote:
> That's not what I read into it. It seems to me that the checksum
> function gives a 32bit result, and we are squashing that into a 16bit
> hash table. Can you point me to the code? Wayne?

Look in match.c.  The hash table is "tag_table", and it's built in the
function "build_hash_table()".  It uses the macro gettag() to transform
the weak checksum into a 16-bit index into the hash table by adding the
two halves of the 32-bit checksum together.

> >However, if you choose to start with the 32 bit rolling hash and mod
> >that, you will have problems.  The rolling checksum has two distinct
> >parts, and modding will only pull info from the low order bits,
>
> Why? This may be something I missed within the code.

He's talking about potentially losing an even distribution of values if
the lowest order bits aren't random enough.  I think we're all only
guessing that it might cluster too much if you just take the value%N
(at least I haven't tried to look at this aspect of the checksum).

> If the source is 16 bit, doing any hash table size bigger than 65536
> buckets would make no sense, true. Is it 16bit?

The source value is 32 bits.  The "two parts" you ask about refers to
the gettag() macro's current algorithm of how it transforms the 32-bit
checksum into a 16-bit hash index.

..wayne..
--
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

<
--------------------
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Reply via email to