Re: Dynamic hash table size (with static has load)

Shachar Shemesh Sat, 25 Feb 2006 10:43:03 -0800

Wayne Davison wrote:

>Thanks for the patch!  Here's some comments:
>
> - You didn't change the size of the "tag" typedef (an unsigned short),
>   and your patch makes the value potentially overflow.
>  
>
Gotcha. I'm sending an amended patch.


> - For smaller hash-table sizes, your algorithm does a lookup in the
>   table based only on the s1 value (due to the (s2 << 16) value being
>   too large to have any remainder less than the tablesize).
>
>So, I think this probably needs to leave gettag() calling gettag2(), and
>change gettag2() to factor both s1 and s2 into some kind of an improved
>tag-generating computation.
>  
>
I disagree.

Let's begin with an example. Suppose that we only have 7 hashes
(s->count=7). 7/8=0. 0*10=0. 0+11=11. Our hash table size is 11, which
is the absolute minimum it will ever get.

Now let's suppose all 7 hashes have "1234" as their lower hash value,
and have the number 1000 through 1006 as their high value. They will be
filed to:
1000:1234 -> 4
1001:1234 -> 2
1002:1234 -> 0
1003:1234 -> 9
1004:1234 -> 7
1005:1234 -> 5
1006:1234 -> 3

Obviously, the higher checksum DID get a chance to affect the cell we
land in.

For the more general case, our function (s1+s2*65536)%ts (where ts is
the table size). Modern algebra dictates that this is the same as saying
(s1%ts + (s2%ts) * (65536%ts))%ts. In other words, you can first mod
each element individually, and only then do the actual addition and
subtraction. It's easy to see that s2 will not get nullified ever,
unless 65536%ts is zero. As 65536 is 2^16, and as ts is guarenteed to be
odd, this is impossible.

Venturing deeper into modern algebra, we know it is theoretically
possible that s2 will have some affect on the hash cell chosen, but will
not be able to choose any cell at all. This can be seen in the case of
(s1+s2*15)%9. If s1 is, say, 3, the different s2 values can select cells
3, 6 and 0. This will happen if and only if the factor (15) and the
modulo (9) have a greatest common divisor (gcd - open office calc
actually has a function of that name) which is larger than 1 (3, in this
case). In jargon, we will say that two number that have a gcd of 1 are
coprime.

Since ts is always odd (we multiply a number by 10 and then add 11), it
will always be coprime to 65536 (which is only divided by even numbers).
This means that s2 has as much a chance to select the hash cell we end
up in as s1. I don't think it is necessary to change that aspect of the
code.

I did change the comment in the patch to summarize this point.

>..wayne..
>  
>
       Shachar

? dynamic_hash.patch
Index: match.c
===================================================================
RCS file: /cvsroot/rsync/match.c,v
retrieving revision 1.78
diff -u -r1.78 match.c
--- match.c	24 Feb 2006 16:43:44 -0000	1.78
+++ match.c	25 Feb 2006 18:42:05 -0000
@@ -26,9 +26,8 @@
 
 int updating_basis_file;
 
-typedef unsigned short tag;
+typedef unsigned int32 tag;
 
-#define TABLESIZE (1<<16)
 #define NULL_TAG (-1)
 
 static int false_alarms;
@@ -49,10 +48,11 @@
 
 static struct target *targets;
 
+static size_t tablesize;
 static int32 *tag_table;
 
-#define gettag2(s1,s2) (((s1) + (s2)) & 0xFFFF)
-#define gettag(sum) gettag2((sum)&0xFFFF,(sum)>>16)
+#define gettag2(s1,s2) gettag((s1) + ((s2)<<16))
+#define gettag(sum) ((sum)%tablesize)
 
 static int compare_targets(struct target *t1,struct target *t2)
 {
@@ -64,8 +64,14 @@
 {
 	int32 i;
 
-	if (!tag_table)
-		tag_table = new_array(int32, TABLESIZE);
+	/* Dynamically calculate the hash table size so that the hash load
+	 * is always about 80%.
+	 * This number must be odd or s2 will not be able to span the entire set
+	 */
+	tablesize=(s->count/8)*10+11;
+	
+	free(tag_table);
+	tag_table = new_array(int32, tablesize);
 
 	targets = new_array(struct target, s->count);
 	if (!tag_table || !targets)
@@ -78,7 +84,7 @@
 
 	qsort(targets,s->count,sizeof(targets[0]),(int (*)())compare_targets);
 
-	for (i = 0; i < TABLESIZE; i++)
+	for (i = 0; i < tablesize; i++)
 		tag_table[i] = NULL_TAG;
 
 	for (i = s->count; i-- > 0; )

-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Re: Dynamic hash table size (with static has load)

Reply via email to