Re: Popcount optimization using AVX512

Ants Aasma Mon, 01 Apr 2024 14:12:33 -0700

On Mon, 1 Apr 2024 at 18:53, Nathan Bossart <nathandboss...@gmail.com> wrote:
>
> On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote:
> > On 2024-Mar-31, Nathan Bossart wrote:
> >> +    popcnt = _mm512_reduce_add_epi64(accum);
> >> +    return popcnt + pg_popcount_fast(buf, bytes);
> >
> > Hmm, doesn't this arrangement cause an extra function call to
> > pg_popcount_fast to be used here?  Given the level of micro-optimization
> > being used by this code, I would have thought that you'd have tried to
> > avoid that.  (At least, maybe avoid the call if bytes is 0, no?)
>
> Yes, it does.  I did another benchmark on very small arrays and can see the
> overhead.  This is the time in milliseconds to run pg_popcount() on an
> array 1 billion times:
>
>     size (bytes)  HEAD      AVX512-POPCNT
>     1             1707.685  3480.424
>     2             1926.694  4606.182
>     4             3210.412  5284.506
>     8             1920.703  3640.968
>     16            2936.91   4045.586
>     32            3627.956  5538.418
>     64            5347.213  3748.212
>
> I suspect that anything below 64 bytes will see this regression, as that is
> the earliest point where there are enough bytes for ZMM registers.


What about using the masking capabilities of AVX-512 to handle the
tail in the same code path? Masked out portions of a load instruction
will not generate an exception. To allow byte level granularity
masking, -mavx512bw is needed. Based on wikipedia this will only
disable this fast path on Knights Mill (Xeon Phi), in all other cases
VPOPCNTQ implies availability of BW.

Attached is an example of what I mean. I did not have a machine to
test it with, but the code generated looks sane. I added the clang
pragma because it insisted on unrolling otherwise and based on how the
instruction dependencies look that is probably not too helpful even
for large cases (needs to be tested). The configure check and compile
flags of course need to be amended for BW.

Regards,
Ants Aasma

diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index f86558d1ee5..7fb2ada16c9 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -30,20 +30,27 @@
 uint64
 pg_popcount_avx512(const char *buf, int bytes)
 {
-	uint64		popcnt;
+	__m512i		val, cnt;
+	__mmask64	remaining_mask;
 	__m512i		accum = _mm512_setzero_si512();
 
-	for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+	#pragma clang loop unroll(disable)
+	for (; bytes > sizeof(__m512i); bytes -= sizeof(__m512i))
 	{
-		const		__m512i val = _mm512_loadu_si512((const __m512i *) buf);
-		const		__m512i cnt = _mm512_popcnt_epi64(val);
+		val = _mm512_loadu_si512((const __m512i *) buf);
+		cnt = _mm512_popcnt_epi64(val);
 
 		accum = _mm512_add_epi64(accum, cnt);
 		buf += sizeof(__m512i);
 	}
 
-	popcnt = _mm512_reduce_add_epi64(accum);
-	return popcnt + pg_popcount_fast(buf, bytes);
+	remaining_mask = ~0ULL >> (sizeof(__m512i) - bytes);
+	val = _mm512_maskz_loadu_epi8(remaining_mask, (const __m512i *) buf);
+	cnt = _mm512_popcnt_epi64(val);
+
+	accum = _mm512_add_epi64(accum, cnt);
+
+	return _mm512_reduce_add_epi64(accum);
 }
 
 #endif							/* TRY_POPCNT_FAST */

Re: Popcount optimization using AVX512

Reply via email to