Hi, David R and I were discussing vectorisation and microarchitectures and what you can expect the target microarchitecture to be these days, and it seemed like some of our choices are not really very forward-looking.
Distros targeting x86-64 traditionally assumed the original AMD64 K8 instruction set, so if we want to use newer instructions we use various configure or runtime checks to see if that's safe. Recent GCC and Clang versions understand -march=x86-64-v{2,3,4}[1]. RHEL9 and similar and SUSE tumbleweed now require x86-64-v2, and IIUC they changed the -march default to -v2 in their build of GCC, and I think Ubuntu has something in the works perhaps for -v3[2]. Some of our current tricks won't won't take proper advantage of that: we'll still access POPCNT through a function pointer! I was wondering how to do it. One idea that seems kinda weird is to try $(CC) -S test_builtin_popcount.c, and then grepping for POPCNT in test_builtin_popcount.s! I assume we don't want to use __builtin_popcount() if it doesn't generate the instruction (using the compiler flags provided or default otherwise), because on a more conservative distro we'll use GCC/Clang's fallback code, instead of our own runtime-checked POPCNT-instruction-through-a-function-pointer. (Presumably we believed that to be better.) Being able to use __builtin_popcount() directly without any function pointer nonsense is obviously faster, but also automatically vectorisable. That's not like the CRC32 instruction checks we have, because those either work or don't work with default compiler flags, but for POPCNT it always works but might general fallback code instead of the desired instruction so you have to inspect what it generates. FWIW Windows 11 on x86 requires the POPCNT instruction to boot. Windows 10 EOL is October next year so we can use MSVC's intrinsic without a function pointer if we just wait :-) All ARM64 bit systems have CNT, but we don't use it! Likewise for all modern POWER (8+) and SPARC chips that any OS can actually run on these days. For RISCV it's part of the bit manipulation option, but we're already relying on that by detecting and using other pg_bitutils.h builtins. So I think we should probably just use the builtin directly everywhere, except on x86 where we should either check if it generates the instruction we want, OR, if we can determine that the modern GCC/Clangfallback code is actually faster than our function pointer hop, then maybe we should just always use it even there, after checking that it exists. [1] https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels [2] https://ubuntu.com/blog/optimising-ubuntu-performance-on-amd64-architecture