David, You are correct on many counts. Our divergent domains of development definitely colors our biases. I work with high end out of order processors.
The Intel Core i7 has a multiply latency of 3 cycles and can issue a new multiply into the pipeline every cycle. It has a worst case integer divide of 26 cycles. This is obviously less than the minimum of 32 cycles if it were developing one quotient bit per cycle. The 26 cycles reflects a moderate fixed startup, a max of 16 double-clocked cycles to carry out the division and a small fixed overhead to deliver the final result. During start up the operands pass through normal integer data path elements: - OR numerator and denominator and count resulting leading zeros - compute loop count: 32 - # leading zeros I am less familiar with the Atom's micro-architecture. A bit of Googling suggests that it has a partially pipelined multiplier with 5 cycle latency and confirms your claim of 50 cycles (worst case?) for divide. Clearly this is not double clocked. I could not find any indication of whether or not the divider incorporates early out preprocessing. > If your hash function has appropriate avalanching and funneling so > that you believe it is a good approximation to uniform, then there's > no reason to believe that the entropy obtained by a using 2^n bits > should be any better or worse than reducing modulo a prime. Digest functions that exhibit good avalanching and funneling are not cheap. True they need not be anywhere near as expensive as cryptographic hashes. But typically they will swamp the savings achieve by avoiding a divide. Again, our differing domains of development probably come out. For the past decade I have worked on a very high performance database execution engine. Our workhorse join operator is a hash join. Here I need an architecture that performs well irrespective of join column datatypes. Joining strings columns I might possibly be able to justify a high quality digest function with good avalanching and funneling but I surely cannot do so when joining 32-bit or 64-bit integer columns (our bread and butter). In the case of a 32-bit integer my digest is the identity function. In the case of a 64-bit integer I simply xor the 2 32-bit halves. Under such conditions a power of 2 performs very poorly. /john