On Thu, Oct 15, 2009 at 11:14:52PM +0200, Aurelien Jarno wrote:
> @@ -394,10 +395,7 @@ uint32_t HELPER(uxtb16)(uint32_t x)
>  
>  uint32_t HELPER(clz)(uint32_t x)
>  {
> -    int count;
> -    for (count = 32; x; count--)
> -        x >>= 1;
> -    return count;
> +    return clz32(x);
>  }
>  
>  int32_t HELPER(sdiv)(int32_t num, int32_t den)

Just a quick note that the implementation of clz, ctz and popcnt is
still listed in the TCG TODO list.  The last time I looked, I noticed
that quite a few architectures have clz/ctz instructions:

   http://lkml.indiana.edu/hypermail/linux/kernel/0601.3/1683.html

For those that don't, I think a combination the following two hacks at
http://graphics.stanford.edu/~seander/bithacks.html could be used:

   'Round up to the next highest power of 2'
   'Counting bits set, in parallel'

With this, it should be possible to implement clz and ctz without too
many operations for both 32-bit and 64-bit integers, without requiring
floats, lookup tables or branches.  Of course, __builtin_clz() might
well do a better job...

BTW, it may be worth pointing out:

   B[4] = 0x0000ffff;
   B[3] = B[4] ^ (B[4] << 8) => 0x00ff00ff
   B[2] = B[3] ^ (B[3] << 4) => 0x0f0f0f0f
   B[1] = B[2] ^ (B[2] << 2) => 0x33333333
   B[0] = B[1] ^ (B[1] << 1) => 0x55555555

In reality, I wonder if five separate loads would be quicker, though.

Cheers,
-- 
Stuart Brady


Reply via email to