Changing default -march landscape

Thomas Munro Wed, 12 Jun 2024 16:12:53 -0700

Hi,

David R and I were discussing vectorisation and microarchitectures and
what you can expect the target microarchitecture to be these days, and
it seemed like some of our choices are not really very
forward-looking.


Distros targeting x86-64 traditionally assumed the original AMD64 K8
instruction set, so if we want to use newer instructions we use
various configure or runtime checks to see if that's safe.

Recent GCC and Clang versions understand -march=x86-64-v{2,3,4}[1].
RHEL9 and similar and SUSE tumbleweed now require x86-64-v2, and IIUC
they changed the -march default to -v2 in their build of GCC, and I
think Ubuntu has something in the works perhaps for -v3[2].

Some of our current tricks won't won't take proper advantage of that:
we'll still access POPCNT through a function pointer!  I was wondering
how to do it.  One idea that seems kinda weird is to try $(CC) -S
test_builtin_popcount.c, and then grepping for POPCNT in
test_builtin_popcount.s!  I assume we don't want to use
__builtin_popcount() if it doesn't generate the instruction (using the
compiler flags provided or default otherwise), because on a more
conservative distro we'll use GCC/Clang's fallback code, instead of
our own runtime-checked POPCNT-instruction-through-a-function-pointer.
(Presumably we believed that to be better.)  Being able to use
__builtin_popcount() directly without any function pointer nonsense is
obviously faster, but also automatically vectorisable.

That's not like the CRC32 instruction checks we have, because those
either work or don't work with default compiler flags, but for POPCNT
it always works but might general fallback code instead of the desired
instruction so you have to inspect what it generates.

FWIW Windows 11 on x86 requires the POPCNT instruction to boot.
Windows 10 EOL is October next year so we can use MSVC's intrinsic
without a function pointer if we just wait :-)

All ARM64 bit systems have CNT, but we don't use it!  Likewise for all
modern POWER (8+) and SPARC chips that any OS can actually run on
these days.  For RISCV it's part of the bit manipulation option, but
we're already relying on that by detecting and using other
pg_bitutils.h builtins.

So I think we should probably just use the builtin directly
everywhere, except on x86 where we should either check if it generates
the instruction we want, OR, if we can determine that the modern
GCC/Clangfallback code is actually faster than our function pointer
hop, then maybe we should just always use it even there, after
checking that it exists.

[1] https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels
[2] https://ubuntu.com/blog/optimising-ubuntu-performance-on-amd64-architecture

Changing default -march landscape

Reply via email to