On Wed 06 May 2015 06:42:30 PM CEST, Stefan Hajnoczi wrote:

>> By using a hash of the offset as the starting point, lookups are
>> faster and independent from the array size.
>> 
>> The hash is computed using the cluster number of the table, multiplied
>> by 4 to make it perform better when there are collisions.
> ...
>> +    i = lookup_index = (offset / c->table_size * 4) % c->size;

> The hash function is very weak.  It's better than always starting with
> i=0 but mixes input bits very poorly.  Have you considered an integer
> hash function so that more input bits affect the output?
> http://burtleburtle.net/bob/hash/integer.html

I didn't try very hard to optimize the hash function because it doesn't
have a big impact, I just wanted something simple and a bit better than
i=0.

I did my tests with a disk image with preallocated metadata and a cache
size big enough for the whole disk. In that scenario I managed an
average of 2.5 comparisons per lookup with this function.

> Regarding collisions, the multiply-by-4 effectively reduces the hash
> space by a factor of 4.  This *increases* the chance of collisions in
> the hope that the extra slots will prevent chains from forming.

It does, but it keeps the average much lower (in my tests at least), it
went down from >100 to 2.5.

> Feel free to leave the hash function as-is, I don't think it's a huge
> issue either way.

I think I'll just leave it as-is for the time being, but thanks a lot
for your comments.

Berto

Reply via email to