While reviewing the patch to speed up Gist indexes with tsvectors [1] by smarter use of popcount, I was reminded that for hardware popcount, we do an indirect function call to execute a single CPU instruction, one word at a time. I went ahead and did some microbenchmarks to see how much that buys us, and if we could do better.
For the tests below I used the attached C file compiled into a shared library with gcc 8.4, measuring popcount on an 8kB buffer, median of 3 runs: select drive_popcount64(1000000, 1024); select drive_popcount(1000000, 1024); Note: pg_popcount64() is passed one 8-byte word, and pg_popcount() is passed a buffer, which if properly aligned allows to call the appropriate word-at-a-time popcount function. master -- indirect call to pg_popcount64_asm(), where available. Note that passing a buffer is faster: pg_popcount64() 2680ms pg_popcount() 1640ms 0001 ignores assembly and uses direct calls to popcount64_slow(). It is quite a bit slower, but notice that passing a buffer to pg_popcont() with the slow implementation is not all that much slower than calling the indirect assembly one word at a time (2680ms vs 3190ms): pg_popcount64() 4930ms pg_popcount() 3190ms It's also worth noting that currently all platforms use an indirect function call, regardless of instruction support, so the direct call here is a small win for non-x86-64 platforms. 0002 restores indirection through a function pointer, but chooses the pass-a-buffer function rather than the retail function, and if assembly is available, calls inlined popcount64_asm(). This in itself is not much faster than master (1640ms vs 1430ms): pg_popcount() 1430ms However, 0003 takes the advice of [2] and [3] and uses hand-written unrolled assembly, and now we actually have some real improvement, over 3x faster than master: pg_popcount() 494ms 0004 shows how it would look to use the buffer-passing version for bitmapsets and the visibility map. Bitmapsets would benefit from this even on master, going by the test results above. If we went with something like 0003, bitmapsets would benefit further automatically. However, counting of visibility map bits is a bit more tricky, since we have to mask off the visible bits and frozen bits to count them separately. 0004 has a simple unoptimized function to demonstrate. As I recall, the VM code did not benefit much from the popcount infrastructure to begin with, so a small regression might not be noticeable. If it is, it's just a SMOP to offer an assembly version here also. The motivation for benchmarking this was to have some concrete numbers in mind while reviewing gist index creation. If the gist patch can benefit from any of the above, it's worth considering, as a more holistic approach. Since it affects other parts of the system, I wanted to share this in a new thread first. [1] https://commitfest.postgresql.org/32/3023/ [2] https://danluu.com/assembly-intrinsics/ [3] https://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-counter-with-64-bit-introduces-crazy-performance-deviati -- John Naylor EDB: http://www.enterprisedb.com
v1-0001-Use-direct-function-calls-for-pg_popcount-32-64.patch
Description: Binary data
v1-0002-Use-platform-specific-implementations-of-pg_popco.patch
Description: Binary data
v1-0003-Manually-unroll-the-loop-with-hand-written-assemb.patch
Description: Binary data
v1-0004-Use-pg_popcount-rather-than-pg_popcount-32-64-whe.patch
Description: Binary data
drive_popcount.c
Description: Binary data