Shachar-
I think Wayne is mostly
pointing you to the correct location here.
If you look at the code where the checksum
is computer, you'll find that the rolling
checksum actually consists of two parts -
one is the sum of all of the byte values
in the window, the other is the offset weighted
sum of all of the byte values in the window.
The first of these is then left shifted 16
bits and added to the other to come up with
the official "32 bit" rolling checksum.
This works fine as long as you aren't counting
on a random distribution of bits among the
32 - if you mod the value, you are giving
much greater importance to the lower XX bits,
effectively dropping the distribution of
the high order bits...
Anyway, the two 16 bit
values may be random enough that my concern
is not founded, but it should be tested before
assuming that the rolling checksum is really
a 32 bit value that can easily be divided
up into buckets.
PS - none of the above
has anything to do with the strong signature
of the window - just the rolling check sum.
Cheers!
- K
Original Message
--------------------
>
On Sat, Mar 05, 2005 at 02:18:01PM +0200,
Shachar Shemesh wrote: > That's not
what I read into it. It seems to me that
the checksum > function gives a 32bit
result, and we are squashing that into a
16bit > hash table. Can you point
me to the code? Wayne?
Look in match.c.
The hash table is "tag_table", and
it's built in the function "build_hash_table()".
It uses the macro gettag() to transform the
weak checksum into a 16-bit index into the
hash table by adding the two halves of
the 32-bit checksum together.
>
>However, if you choose to start with
the 32 bit rolling hash and mod >
>that, you will have problems. The
rolling checksum has two distinct >
>parts, and modding will only pull info
from the low order bits, > >
Why? This may be something I missed within
the code.
He's talking about potentially
losing an even distribution of values if the
lowest order bits aren't random enough. I
think we're all only guessing that it
might cluster too much if you just take the
value%N (at least I haven't tried to look
at this aspect of the checksum).
>
If the source is 16 bit, doing any hash table
size bigger than 65536 > buckets would
make no sense, true. Is it 16bit?
The
source value is 32 bits. The "two parts"
you ask about refers to the gettag() macro's
current algorithm of how it transforms the
32-bit checksum into a 16-bit hash index.
..wayne.. --
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before
posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
<
--------------------
|
--
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html