On Thu, Oct 15, 2009 at 11:14:52PM +0200, Aurelien Jarno wrote: > @@ -394,10 +395,7 @@ uint32_t HELPER(uxtb16)(uint32_t x) > > uint32_t HELPER(clz)(uint32_t x) > { > - int count; > - for (count = 32; x; count--) > - x >>= 1; > - return count; > + return clz32(x); > } > > int32_t HELPER(sdiv)(int32_t num, int32_t den)
Just a quick note that the implementation of clz, ctz and popcnt is still listed in the TCG TODO list. The last time I looked, I noticed that quite a few architectures have clz/ctz instructions: http://lkml.indiana.edu/hypermail/linux/kernel/0601.3/1683.html For those that don't, I think a combination the following two hacks at http://graphics.stanford.edu/~seander/bithacks.html could be used: 'Round up to the next highest power of 2' 'Counting bits set, in parallel' With this, it should be possible to implement clz and ctz without too many operations for both 32-bit and 64-bit integers, without requiring floats, lookup tables or branches. Of course, __builtin_clz() might well do a better job... BTW, it may be worth pointing out: B[4] = 0x0000ffff; B[3] = B[4] ^ (B[4] << 8) => 0x00ff00ff B[2] = B[3] ^ (B[3] << 4) => 0x0f0f0f0f B[1] = B[2] ^ (B[2] << 2) => 0x33333333 B[0] = B[1] ^ (B[1] << 1) => 0x55555555 In reality, I wonder if five separate loads would be quicker, though. Cheers, -- Stuart Brady