On Tue, 9 Jan 2024 at 16:03, Peter Eisentraut <pe...@eisentraut.org> wrote: > On 29.11.23 18:15, Nathan Bossart wrote: > > Using the same benchmark as we did for the SSE2 linear searches in > > XidInMVCCSnapshot() (commit 37a6e5d) [1] [2], I see the following: > > > > writers sse2 avx2 % > > 256 1195 1188 -1 > > 512 928 1054 +14 > > 1024 633 716 +13 > > 2048 332 420 +27 > > 4096 162 203 +25 > > 8192 162 182 +12 > > AFAICT, your patch merely provides an alternative AVX2 implementation > for where currently SSE2 is supported, but it doesn't provide any new > API calls or new functionality. One might naively expect that these are > just two different ways to call the underlying primitives in the CPU, so > these performance improvements are surprising to me. Or do the CPUs > actually have completely separate machinery for SSE2 and AVX2, and just > using the latter to do the same thing is faster?
The AVX2 implementation uses a wider vector register. On most current processors the throughput of the instructions in question is the same on 256bit vectors as on 128bit vectors. Basically, the chip has AVX2 worth of machinery and using SSE2 leaves half of it unused. Notable exceptions are efficiency cores on recent Intel desktop CPUs and AMD CPUs pre Zen 2 where AVX2 instructions are internally split up into two 128bit wide instructions. For AVX512 the picture is much more complicated. Some instructions run at half rate, some at full rate, but not on all ALU ports, some instructions cause aggressive clock rate reduction on some microarchitectures. AVX-512 adds mask registers and masked vector instructions that enable quite a bit simpler code in many cases. Interestingly I have seen Clang make quite effective use of these masked instructions even when using AVX2 intrinsics, but targeting an AVX-512 capable platform. The vector width independent approach used in the patch is nice for simple cases by not needing a separate implementation for each vector width. However for more complicated cases where "horizontal" operations are needed it's going to be much less useful. But these cases can easily just drop down to using intrinsics directly.