Hi all, In commit ab5b4e2f9ed, we optimized AllocSetFreeIndex() using a lookup table. At the time, using CLZ was rejected because compiler/platform support was not widespread enough to justify it. For other reasons, we recently added bitutils.h which uses __builtin_clz() where available, so it makes sense to revisit this. I modified the test in [1] (C files attached), using two separate functions to test CLZ versus the open-coded algorithm of pg_leftmost_one_pos32().
These are typical results on a recent Intel platform: HEAD 5.55s clz 4.51s open-coded 9.67s CLZ gives a nearly 20% speed boost on this microbenchmark. I suspect that this micro-benchmark is actually biased towards the lookup table more than real-world workloads, because it can monopolize the L1 cache. Sparing cache is possibly the more interesting reason to use CLZ. The open-coded version is nearly twice as slow, so it makes sense to keep the current implementation as the default one, and not use pg_leftmost_one_pos32() directly. However, with a small tweak, we can reuse the lookup table in bitutils.c instead of the bespoke one used solely by AllocSetFreeIndex(), saving a couple cache lines here also. This is done in the attached patch. [1] https://www.postgresql.org/message-id/407d949e0907201811i13c73e18x58295566d27aadcc%40mail.gmail.com -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
0001-Use-the-CLZ-instruction-in-AllocSetFreeIndex.patch
Description: Binary data
test_allocsetfreeindex.c
Description: Binary data
test_allocsetfreeindex2.c
Description: Binary data