Here are two useful references:

  http://bretm.home.comcast.net/hash/
  http://burtleburtle.net/bob/hash/

RE computation cost: Generalized integer multiplication and integer
division both used to be expensive.  On modern processors a
multiplication is fully pipelined and comparable to an L1 data cache
hit (~2 to 4 cycles).  Most modern implementations of division run the
divide circuit at a multiple greater than 1 of the cpu clock
(typically 2x) and implement early out.  Further division is nearly
always an autonomous, non-pipelined logic block that once initiated
runs asynchronously until it is ready to deliver its result.  Bottom
line: on any contemporary out-of-order x86 chip from either Intel or
AMD the cost of either of these instruction is negligible.

RE general hashing concepts: I find it helpful to think of a hash
table lookup as proceeding in steps.  The first step is to generate a
fixed size digest of the (possibly variable length) key.  On a 32 bit
machine this digest most commonly is a 32 bit (unsigned) integer.  The
quality of a digest is related to how much of the entropy in the key
it is able to preserve.  The second step is to map the digest into a
bucket selector.  Nearly universally this is a modulo operation taking
the remainder of the digest divided by the number of buckets.  A
particularly poor approach is to have a power of 2 number of buckets
so as to to reduce the modulo operation to simple masking.  The reason
is that this discards all entropy in all masked out bits.  A _much_
better approach is to pick the number of buckets to be a prime number
-- the remainder of division by a prime number will be influenced by
_every_ bit in the digest.  (When I need to implement a dynamic hash
table I always include a statically initialized array of primes chosen
to give me the growth pattern I desire.)

/john

Reply via email to