Re: Popcount optimization using AVX512

Ants Aasma Tue, 02 Apr 2024 13:31:07 -0700

On Tue, 2 Apr 2024 at 00:31, Nathan Bossart <nathandboss...@gmail.com> wrote:
> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
> > What about using the masking capabilities of AVX-512 to handle the
> > tail in the same code path? Masked out portions of a load instruction
> > will not generate an exception. To allow byte level granularity
> > masking, -mavx512bw is needed. Based on wikipedia this will only
> > disable this fast path on Knights Mill (Xeon Phi), in all other cases
> > VPOPCNTQ implies availability of BW.
>
> Sounds promising.  IMHO we should really be sure that these kinds of loads
> won't generate segfaults and the like due to the masked-out portions.  I
> searched around a little bit but haven't found anything that seemed
> definitive.


After sleeping on the problem, I think we can avoid this question
altogether while making the code faster by using aligned accesses.
Loads that straddle cache line boundaries run internally as 2 load
operations. Gut feel says that there are enough out-of-order resources
available to make it not matter in most cases. But even so, not doing
the extra work is surely better. Attached is another approach that
does aligned accesses, and thereby avoids going outside bounds.

Would be interesting to see how well that fares in the small use case.
Anything that fits into one aligned cache line should be constant
speed, and there is only one branch, but the mask setup and folding
the separate popcounts together should add up to about 20-ish cycles
of overhead.

Regards,
Ants Aasma

diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index f86558d1ee5..e1fbd98fa14 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -30,20 +30,44 @@
 uint64
 pg_popcount_avx512(const char *buf, int bytes)
 {
-	uint64		popcnt;
+	__m512i		val, cnt;
 	__m512i		accum = _mm512_setzero_si512();
+	const char *final;
+	int 		tail_idx;
+	__mmask64	mask = -1;
 
-	for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
-	{
-		const		__m512i val = _mm512_loadu_si512((const __m512i *) buf);
-		const		__m512i cnt = _mm512_popcnt_epi64(val);
+	/*
+	 * Align buffer down to avoid double load overhead from unaligned access.
+	 * Calculate a mask to ignore preceding bytes. Find start offset of final
+	 * iteration and number of valid bytes making sure that final iteration
+	 * is not empty.
+	 */
+	mask <<= ((uintptr_t) buf) % sizeof(__m512i);
+	tail_idx = (((uintptr_t) buf + bytes - 1) % sizeof(__m512i)) + 1;
+	final = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf + bytes - 1);
+	buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
 
+	/*
+	 * Iterate through all but the final iteration. Starting from second
+	 * iteration, the start index mask is ignored.
+	 */
+	for (; buf < final; buf += sizeof(__m512i))
+	{
+		val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
+		cnt = _mm512_popcnt_epi64(val);
 		accum = _mm512_add_epi64(accum, cnt);
-		buf += sizeof(__m512i);
+
+		mask = -1;
 	}
 
-	popcnt = _mm512_reduce_add_epi64(accum);
-	return popcnt + pg_popcount_fast(buf, bytes);
+	/* Final iteration needs to ignore bytes that are not within the length */
+	mask &= ((~0ULL) >> (64 - tail_idx));
+
+	val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
+	cnt = _mm512_popcnt_epi64(val);
+	accum = _mm512_add_epi64(accum, cnt);
+
+	return _mm512_reduce_add_epi64(accum);
 }
 
 #endif							/* TRY_POPCNT_FAST */

Re: Popcount optimization using AVX512

Reply via email to