On Wed 06 May 2015 06:42:30 PM CEST, Stefan Hajnoczi wrote: >> By using a hash of the offset as the starting point, lookups are >> faster and independent from the array size. >> >> The hash is computed using the cluster number of the table, multiplied >> by 4 to make it perform better when there are collisions. > ... >> + i = lookup_index = (offset / c->table_size * 4) % c->size;
> The hash function is very weak. It's better than always starting with > i=0 but mixes input bits very poorly. Have you considered an integer > hash function so that more input bits affect the output? > http://burtleburtle.net/bob/hash/integer.html I didn't try very hard to optimize the hash function because it doesn't have a big impact, I just wanted something simple and a bit better than i=0. I did my tests with a disk image with preallocated metadata and a cache size big enough for the whole disk. In that scenario I managed an average of 2.5 comparisons per lookup with this function. > Regarding collisions, the multiply-by-4 effectively reduces the hash > space by a factor of 4. This *increases* the chance of collisions in > the hope that the extra slots will prevent chains from forming. It does, but it keeps the average much lower (in my tests at least), it went down from >100 to 2.5. > Feel free to leave the hash function as-is, I don't think it's a huge > issue either way. I think I'll just leave it as-is for the time being, but thanks a lot for your comments. Berto