On Mon, 24 Feb 2025 14:27:03 -0500
Yury Norov <yury.no...@gmail.com> wrote:
....
> +#define parity(val)                                  \
> +({                                                   \
> +     u64 __v = (val);                                \
> +     int __ret;                                      \
> +     switch (BITS_PER_TYPE(val)) {                   \
> +     case 64:                                        \
> +             __v ^= __v >> 32;                       \
> +             fallthrough;                            \
> +     case 32:                                        \
> +             __v ^= __v >> 16;                       \
> +             fallthrough;                            \
> +     case 16:                                        \
> +             __v ^= __v >> 8;                        \
> +             fallthrough;                            \
> +     case 8:                                         \
> +             __v ^= __v >> 4;                        \
> +             __ret =  (0x6996 >> (__v & 0xf)) & 1;   \
> +             break;                                  \
> +     default:                                        \
> +             BUILD_BUG();                            \
> +     }                                               \
> +     __ret;                                          \
> +})
> +

You really don't want to do that!
gcc makes a right hash of it for x86 (32bit).
See https://www.godbolt.org/z/jG8dv3cvs

You do better using a __v32 after the 64bit xor.

Even the 64bit version is probably sub-optimal (both gcc and clang).
The whole lot ends up being a bit single register dependency chain.
You want to do:
        mov %eax, %edx
        shrl $n, %eax
        xor %edx, %eax
so that the 'mov' and 'shrl' can happen in the same clock
(without relying on the register-register move being optimised out).

I dropped in the arm64 for an example of where the magic shift of 6996
just adds an extra instruction.

        David


Reply via email to