Re: Question about the Implementation of vector32_is_highbit_set on ARM

John Naylor Mon, 20 Nov 2023 01:06:20 -0800

On Wed, Nov 8, 2023 at 2:44 PM Xiang Gao <xiang....@arm.com> wrote:
>  * function. We could instead adopt the behavior of Arm's vmaxvq_u32(), i.e.
>  * check each 32-bit element, but that would require an additional mask
>  * operation on x86.
>  */


> But I still don't understand why the vmaxvq_u32 intrinsic  is not used on the 
> arm platform.

The current use case expects all 1's or all 0's in a 32-bit lane. If
anyone tried using it for arbitrary values, vmaxvq_u32 could give a
different answer than on x86 using _mm_movemask_epi8, so I think
that's the origin of that comment. But it's still a maintenance hazard
as is, since x86 wouldn't work for arbitrary values. It seems the path
forward is to rename this function to vector32_is_any_lane_set(), as
in the attached (untested on Arm). That would allow each
implementation to use the most efficient path, whether it's by 8- or
32-bit lanes. If we someday needed to look at only the high bits, we
would need a new function that performed the necessary masking on x86.

It's possible this method could shave cycles on Arm in some 8-bit lane
cases where we don't actually care about the high bit specifically,
since the movemask equivalent is slow on that platform, but I haven't
looked yet.

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index 59aa8245ed..f536905d4d 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -151,7 +151,7 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 		result = vector32_or(tmp1, tmp2);
 
 		/* see if there was a match */
-		if (vector32_is_highbit_set(result))
+		if (vector32_is_any_lane_set(result))
 		{
 			Assert(assert_result == true);
 			return true;
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 1fa6c3bc6c..40558bbca8 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -78,7 +78,7 @@ static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
 #ifndef USE_NO_SIMD
-static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline bool vector32_is_any_lane_set(const Vector32 v);
 #endif
 
 /* arithmetic operations */
@@ -278,22 +278,21 @@ vector8_is_highbit_set(const Vector8 v)
 }
 
 /*
- * Exactly like vector8_is_highbit_set except for the input type, so it
- * looks at each byte separately.
+ * Return true if any 32-bit lane is set, otherwise false.
  *
- * XXX x86 uses the same underlying type for 8-bit, 16-bit, and 32-bit
- * integer elements, but Arm does not, hence the need for a separate
- * function. We could instead adopt the behavior of Arm's vmaxvq_u32(), i.e.
- * check each 32-bit element, but that would require an additional mask
- * operation on x86.
+ * XXX: We assume all lanes are either all zeros or all ones.
  */
 #ifndef USE_NO_SIMD
 static inline bool
-vector32_is_highbit_set(const Vector32 v)
+vector32_is_any_lane_set(const Vector32 v)
 {
 #if defined(USE_NEON)
-	return vector8_is_highbit_set((Vector8) v);
+	return vmaxvq_u32(v) > 0;
 #else
+	/*
+	 * There is no 32-bit version of _mm_movemask_epi8, but we can still use
+	 * the 8-bit version.
+	 */
 	return vector8_is_highbit_set(v);
 #endif
 }

Re: Question about the Implementation of vector32_is_highbit_set on ARM

Reply via email to