our use of popcount

John Naylor Wed, 03 Mar 2021 08:41:40 -0800

While reviewing the patch to speed up Gist indexes with tsvectors [1] by
smarter use of popcount, I was reminded that for hardware popcount, we do
an indirect function call to execute a single CPU instruction, one word at
a time. I went ahead and did some microbenchmarks to see how much that buys
us, and if we could do better.


For the tests below I used the attached C file compiled into a shared
library with gcc 8.4, measuring popcount on an 8kB buffer, median of 3 runs:

select drive_popcount64(1000000, 1024);
select drive_popcount(1000000, 1024);

Note: pg_popcount64() is passed one 8-byte word, and pg_popcount() is
passed a buffer, which if properly aligned allows to call the appropriate
word-at-a-time popcount function.


master -- indirect call to pg_popcount64_asm(), where available. Note
that passing a buffer is faster:

pg_popcount64()
2680ms

pg_popcount()
1640ms


0001 ignores assembly and uses direct calls to popcount64_slow(). It is
quite a bit slower, but notice that passing a buffer to pg_popcont() with
the slow implementation is not all that much slower than calling the
indirect assembly one word at a time (2680ms vs 3190ms):

pg_popcount64()
4930ms

pg_popcount()
3190ms

It's also worth noting that currently all platforms use an indirect
function call, regardless of instruction support, so the direct call here
is a small win for non-x86-64 platforms.


0002 restores indirection through a function pointer, but chooses the
pass-a-buffer function rather than the retail function, and if assembly is
available, calls inlined popcount64_asm(). This in itself is not much
faster than master (1640ms vs 1430ms):

pg_popcount()
1430ms


However, 0003 takes the advice of [2] and [3] and uses hand-written
unrolled assembly, and now we actually have some real improvement, over 3x
faster than master:

pg_popcount()
494ms

0004 shows how it would look to use the buffer-passing version for
bitmapsets and the visibility map. Bitmapsets would benefit from this even
on master, going by the test results above. If we went with something like
0003, bitmapsets would benefit further automatically. However, counting of
visibility map bits is a bit more tricky, since we have to mask off the
visible bits and frozen bits to count them separately. 0004 has a simple
unoptimized function to demonstrate. As I recall, the VM code did not
benefit much from the popcount infrastructure to begin with, so a small
regression might not be noticeable. If it is, it's just a SMOP to offer an
assembly version here also.

The motivation for benchmarking this was to have some concrete numbers in
mind while reviewing gist index creation. If the gist patch can benefit
from any of the above, it's worth considering, as a more holistic approach.
Since it affects other parts of the system, I wanted to share this in a new
thread first.

[1] https://commitfest.postgresql.org/32/3023/
[2] https://danluu.com/assembly-intrinsics/
[3]
https://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-counter-with-64-bit-introduces-crazy-performance-deviati

-- 
John Naylor
EDB: http://www.enterprisedb.com

v1-0001-Use-direct-function-calls-for-pg_popcount-32-64.patch
Description: Binary data

v1-0002-Use-platform-specific-implementations-of-pg_popco.patch
Description: Binary data

v1-0003-Manually-unroll-the-loop-with-hand-written-assemb.patch
Description: Binary data

v1-0004-Use-pg_popcount-rather-than-pg_popcount-32-64-whe.patch
Description: Binary data

drive_popcount.c
Description: Binary data

our use of popcount

Reply via email to