On Thu, 2 Nov 2023 at 15:22, Amonson, Paul D <paul.d.amon...@intel.com> wrote: > > This proposal showcases the speed-up provided to popcount feature when using > AVX512 registers. The intent is to share the preliminary results with the > community and get feedback for adding avx512 support for popcount. > > Revisiting the previous discussion/improvements around this feature, I have > created a micro-benchmark based on the pg_popcount() in PostgreSQL's current > implementations for x86_64 using the newer AVX512 intrinsics. Playing with > this implementation has improved performance up to 46% on Intel's Sapphire > Rapids platform on AWS. Such gains will benefit scenarios relying on popcount.
How does this compare to older CPUs, and more mixed workloads? IIRC, the use of AVX512 (which I believe this instruction to be included in) has significant implications for core clock frequency when those instructions are being executed, reducing overall performance if they're not a large part of the workload. > My setup: > > Machine: AWS EC2 m7i - 16vcpu, 64gb RAM > OS : Ubuntu 22.04 > GCC: 11.4 and 12.3 with flags "-mavx -mavx512vpopcntdq -mavx512vl > -march=native -O2". > > 1. I copied the pg_popcount() implementation into a new C/C++ project using > cmake/make. > a. Software only and > b. SSE 64 bit version > 2. I created an implementation using the following AVX512 intrinsics: > a. _mm512_popcnt_epi64() > b. _mm512_reduce_add_epi64() > 3. I tested random bit streams from 64 MiB to 1024 MiB in length (5 sizes; > repeatable with RNG seed [std::mt19937_64]) Apart from the two type functions bytea_bit_count and bit_bit_count (which are not accessed in postgres' own systems, but which could want to cover bytestreams of >BLCKSZ) the only popcount usages I could find were on objects that fit on a page, i.e. <8KiB in size. How does performance compare for bitstreams of such sizes, especially after any CPU clock implications are taken into account? Kind regards, Matthias van de Meent Neon (https://neon.tech)