Re: [PATCH] Hex-coding optimizations using SVE on ARM.

John Naylor Thu, 25 Sep 2025 07:17:12 -0700

On Thu, Sep 25, 2025 at 4:40 AM Nathan Bossart <[email protected]> wrote:
>
> On Wed, Sep 24, 2025 at 10:59:38AM +0700, John Naylor wrote:
> > + if (unlikely(!hex_decode_simd_helper(srcv, &dstv1)))
> > + break;
> >
> > But if you really want to do something here, sprinkling "(un)likely"'s
> > here seems like solving the wrong problem (even if they make any
> > difference), since the early return is optimizing for exceptional
> > conditions. In other places (cf. the UTF8 string verifier), we
> > accumulate errors, and only if we have them at the end do we restart
> > from the beginning with the slow error-checking path that can show the
> > user the offending input.
>
> I switched to an accumulator approach in v11.


Looks good to me.

+ if (unlikely(!success))
+ i = 0;

This is after the main loop exits, and the cold path is literally one
instruction, so the motivation is not apparent to me.

> > The nibble/byte things are rather specific. Wouldn't it be more
> > logical to expose the already-generic shift operations and let the
> > caller say by how much? Or does the compiler refuse because the
> > intrinsic doesn't get an immediate value? Some are like that, but I'm
> > not sure about these. If so, that's annoying and I wonder if there's a
> > workaround.
>
> Yeah, the compiler refuses unless the value is an integer literal.  I
> thought of using a switch statement to cover all the values used in-tree,
> but I didn't like that, either.

Neither option is great, but I mildly lean towards keeping it internal
with "switch" or whatever: By putting the burden of specifying shift
amounts on separately named functions we run a risk of combinatorial
explosion in function names.

If you feel otherwise, I'd at least use actual numbers:
"shift_left_nibble" is an awkward way to say "shift left by 4 bits"
anyway, and also after "byte" and "nibble" there are not many good
English words to convey the operand amount. It's very possible that
needing other shift amounts will never come up, though.

> > +vector8_has_ge(const Vector8 v, const uint8 c)
> > +{
> > +#ifdef USE_SSE2
> > + Vector8 umax = _mm_max_epu8(v, vector8_broadcast(c));
> > + Vector8 cmpe = _mm_cmpeq_epi8(umax, v);
> > +
> > + return vector8_is_highbit_set(cmpe);
> >
> > We take pains to avoid signed comparison on unsigned input for the
> > "le" case, and I don't see why it's okay here.
>
> _mm_max_epu8() does unsigned comparisons, I think...

Ah, I confused myself about what the LE case was avoiding, namely
signed LE, not signed equality on something else.

(Separately, now I'm wondering if we can do the same for
vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that
would allow getting rid of )

> > Do the regression tests have long enough cases that test exceptional
> > paths, like invalid bytes and embedded whitespace? If not, we need
> > some.
>
> Added.

Seems comprehensive enough at a glance.

Other comments:

+ * back together to form the final hex-encoded string.  It might be
+ * possible to squeeze out a little more gain by manually unrolling the
+ * loop, but for now we don't bother.

My position (and I think the community agrees) is that manual
unrolling is a rare desperation move that has to be justified, so we
don't need to mention its lack.

+ * Some compilers are picky about casts to the same underlying type, and others
+ * are picky about implicit conversions with vector types.  This function does
+ * the same thing as vector32_broadcast(), but it returns a Vector8 and is
+ * carefully crafted to avoid compiler indigestion.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_broadcast_u32(const uint32 c)
+{
+#ifdef USE_SSE2
+ return vector32_broadcast(c);
+#elif defined(USE_NEON)
+ return (Vector8) vector32_broadcast(c);
+#endif
+}

I'm ambivalent about this: The use case doesn't seem well motivated,
since I don't know why we'd actually need to both broadcast arbitrary
integers and also view the result as bytes. Setting arbitrary bytes is
what we're really doing, and would be more likely be useful in the
future (attached, only tested on x86, and I think part of the
strangeness is the endianness you mentioned above). On the other hand,
the Arm workaround results in awful generated code compared to what
you have here. Since the "set" should be hoisted out of the outer
loop, and we already rely on this pattern for vector8_highbit_mask
anyway, it might be tolerable, and we can reduce the pain with bitwise
NOT.

+/*
+ * Pack 16-bit elements in the given vectors into a single vector of 8-bit
+ * elements.  NB: The upper 8-bits of each 16-bit element must be zeros, else
+ * this will produce different results on different architectures.
+ */

v10 asserted this requirement -- that still seems like a good thing?


--
John Naylor
Amazon Web Services

diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 7ba92c2c481..eb932d4bd8a 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -317,6 +317,9 @@ static inline bool
 hex_decode_simd_helper(const Vector8 src, Vector8 *dst)
 {
        Vector8         sub;
+       // TODO: set one and use bitwise NOT for the other
+       Vector8         maskupper = vector8_set(0xff, 0, 0xff, 0, 0xff, 0, 
0xff, 0, 0xff, 0, 0xff, 0, 0xff, 0, 0xff, 0);
+       Vector8         masklower = vector8_set(0, 0xff, 0, 0xff, 0, 0xff, 0, 
0xff, 0, 0xff, 0, 0xff, 0, 0xff, 0, 0xff);
        Vector8         msk;
        bool            ret;
 
@@ -334,9 +337,9 @@ hex_decode_simd_helper(const Vector8 src, Vector8 *dst)
        *dst = vector8_issub(src, sub);
        ret = !vector8_has_ge(*dst, 0x10);
 
-       msk = vector8_and(*dst, vector8_broadcast_u32(0xff00ff00));
+       msk = vector8_and(*dst, masklower);
        msk = vector8_shift_right_byte(msk);
-       *dst = vector8_and(*dst, vector8_broadcast_u32(0x00ff00ff));
+       *dst = vector8_and(*dst, maskupper);
        *dst = vector8_shift_left_nibble(*dst);
        *dst = vector8_or(*dst, msk);
        return ret;
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 531d8b8b6d1..56e810ce081 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -101,6 +101,33 @@ static inline Vector8 vector8_min(const Vector8 v1, const 
Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
+/*
+ * Populate a vector element-wise with the arguments.
+ */
+#ifndef USE_NO_SIMD
+#if defined(USE_NEON)
+// from a patch by Thomas Munro
+static inline Vector8
+vector8_set(uint8 v0, uint8 v1, uint8 v2, uint8 v3,
+        uint8 v4, uint8 v5, uint8 v6, uint8 v7,
+        uint8 v8, uint8 v9, uint8 v10, uint8 v11,
+        uint8 v12, uint8 v13, uint8 v14, uint8 v15)
+{
+       uint8 pg_attribute_aligned(16) values[16] = {
+               v0, v1, v2, v3,
+               v4, v5, v6, v7,
+               v8, v9, v10, v11,
+               v12, v13, v14, v15
+       };
+       return vld1q_u8(values);
+}
+#elif defined(USE_SSE2)
+#ifndef vector8_set
+#define vector8_set(...)               _mm_setr_epi8(__VA_ARGS__)
+#endif
+#endif
+#endif                                                 /* ! USE_NO_SIMD */
+
 /*
  * Load a chunk of memory into the given vector.
  */
@@ -368,6 +395,7 @@ vector8_highbit_mask(const Vector8 v)
         * returns a uint64, making it inconvenient to combine mask values from
         * multiple vectors.
         */
+       // TODO: use vector8_set
        static const uint8 mask[16] = {
                1 << 0, 1 << 1, 1 << 2, 1 << 3,
                1 << 4, 1 << 5, 1 << 6, 1 << 7,

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

Reply via email to