On Mon, 1 Apr 2024 at 18:53, Nathan Bossart <nathandboss...@gmail.com> wrote: > > On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote: > > On 2024-Mar-31, Nathan Bossart wrote: > >> + popcnt = _mm512_reduce_add_epi64(accum); > >> + return popcnt + pg_popcount_fast(buf, bytes); > > > > Hmm, doesn't this arrangement cause an extra function call to > > pg_popcount_fast to be used here? Given the level of micro-optimization > > being used by this code, I would have thought that you'd have tried to > > avoid that. (At least, maybe avoid the call if bytes is 0, no?) > > Yes, it does. I did another benchmark on very small arrays and can see the > overhead. This is the time in milliseconds to run pg_popcount() on an > array 1 billion times: > > size (bytes) HEAD AVX512-POPCNT > 1 1707.685 3480.424 > 2 1926.694 4606.182 > 4 3210.412 5284.506 > 8 1920.703 3640.968 > 16 2936.91 4045.586 > 32 3627.956 5538.418 > 64 5347.213 3748.212 > > I suspect that anything below 64 bytes will see this regression, as that is > the earliest point where there are enough bytes for ZMM registers.
What about using the masking capabilities of AVX-512 to handle the tail in the same code path? Masked out portions of a load instruction will not generate an exception. To allow byte level granularity masking, -mavx512bw is needed. Based on wikipedia this will only disable this fast path on Knights Mill (Xeon Phi), in all other cases VPOPCNTQ implies availability of BW. Attached is an example of what I mean. I did not have a machine to test it with, but the code generated looks sane. I added the clang pragma because it insisted on unrolling otherwise and based on how the instruction dependencies look that is probably not too helpful even for large cases (needs to be tested). The configure check and compile flags of course need to be amended for BW. Regards, Ants Aasma
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c index f86558d1ee5..7fb2ada16c9 100644 --- a/src/port/pg_popcount_avx512.c +++ b/src/port/pg_popcount_avx512.c @@ -30,20 +30,27 @@ uint64 pg_popcount_avx512(const char *buf, int bytes) { - uint64 popcnt; + __m512i val, cnt; + __mmask64 remaining_mask; __m512i accum = _mm512_setzero_si512(); - for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i)) + #pragma clang loop unroll(disable) + for (; bytes > sizeof(__m512i); bytes -= sizeof(__m512i)) { - const __m512i val = _mm512_loadu_si512((const __m512i *) buf); - const __m512i cnt = _mm512_popcnt_epi64(val); + val = _mm512_loadu_si512((const __m512i *) buf); + cnt = _mm512_popcnt_epi64(val); accum = _mm512_add_epi64(accum, cnt); buf += sizeof(__m512i); } - popcnt = _mm512_reduce_add_epi64(accum); - return popcnt + pg_popcount_fast(buf, bytes); + remaining_mask = ~0ULL >> (sizeof(__m512i) - bytes); + val = _mm512_maskz_loadu_epi8(remaining_mask, (const __m512i *) buf); + cnt = _mm512_popcnt_epi64(val); + + accum = _mm512_add_epi64(accum, cnt); + + return _mm512_reduce_add_epi64(accum); } #endif /* TRY_POPCNT_FAST */