On Thu, Sep 25, 2025 at 4:40 AM Nathan Bossart <nathandboss...@gmail.com> wrote: > > On Wed, Sep 24, 2025 at 10:59:38AM +0700, John Naylor wrote: > > + if (unlikely(!hex_decode_simd_helper(srcv, &dstv1))) > > + break; > > > > But if you really want to do something here, sprinkling "(un)likely"'s > > here seems like solving the wrong problem (even if they make any > > difference), since the early return is optimizing for exceptional > > conditions. In other places (cf. the UTF8 string verifier), we > > accumulate errors, and only if we have them at the end do we restart > > from the beginning with the slow error-checking path that can show the > > user the offending input. > > I switched to an accumulator approach in v11.
Looks good to me. + if (unlikely(!success)) + i = 0; This is after the main loop exits, and the cold path is literally one instruction, so the motivation is not apparent to me. > > The nibble/byte things are rather specific. Wouldn't it be more > > logical to expose the already-generic shift operations and let the > > caller say by how much? Or does the compiler refuse because the > > intrinsic doesn't get an immediate value? Some are like that, but I'm > > not sure about these. If so, that's annoying and I wonder if there's a > > workaround. > > Yeah, the compiler refuses unless the value is an integer literal. I > thought of using a switch statement to cover all the values used in-tree, > but I didn't like that, either. Neither option is great, but I mildly lean towards keeping it internal with "switch" or whatever: By putting the burden of specifying shift amounts on separately named functions we run a risk of combinatorial explosion in function names. If you feel otherwise, I'd at least use actual numbers: "shift_left_nibble" is an awkward way to say "shift left by 4 bits" anyway, and also after "byte" and "nibble" there are not many good English words to convey the operand amount. It's very possible that needing other shift amounts will never come up, though. > > +vector8_has_ge(const Vector8 v, const uint8 c) > > +{ > > +#ifdef USE_SSE2 > > + Vector8 umax = _mm_max_epu8(v, vector8_broadcast(c)); > > + Vector8 cmpe = _mm_cmpeq_epi8(umax, v); > > + > > + return vector8_is_highbit_set(cmpe); > > > > We take pains to avoid signed comparison on unsigned input for the > > "le" case, and I don't see why it's okay here. > > _mm_max_epu8() does unsigned comparisons, I think... Ah, I confused myself about what the LE case was avoiding, namely signed LE, not signed equality on something else. (Separately, now I'm wondering if we can do the same for vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that would allow getting rid of ) > > Do the regression tests have long enough cases that test exceptional > > paths, like invalid bytes and embedded whitespace? If not, we need > > some. > > Added. Seems comprehensive enough at a glance. Other comments: + * back together to form the final hex-encoded string. It might be + * possible to squeeze out a little more gain by manually unrolling the + * loop, but for now we don't bother. My position (and I think the community agrees) is that manual unrolling is a rare desperation move that has to be justified, so we don't need to mention its lack. + * Some compilers are picky about casts to the same underlying type, and others + * are picky about implicit conversions with vector types. This function does + * the same thing as vector32_broadcast(), but it returns a Vector8 and is + * carefully crafted to avoid compiler indigestion. + */ +#ifndef USE_NO_SIMD +static inline Vector8 +vector8_broadcast_u32(const uint32 c) +{ +#ifdef USE_SSE2 + return vector32_broadcast(c); +#elif defined(USE_NEON) + return (Vector8) vector32_broadcast(c); +#endif +} I'm ambivalent about this: The use case doesn't seem well motivated, since I don't know why we'd actually need to both broadcast arbitrary integers and also view the result as bytes. Setting arbitrary bytes is what we're really doing, and would be more likely be useful in the future (attached, only tested on x86, and I think part of the strangeness is the endianness you mentioned above). On the other hand, the Arm workaround results in awful generated code compared to what you have here. Since the "set" should be hoisted out of the outer loop, and we already rely on this pattern for vector8_highbit_mask anyway, it might be tolerable, and we can reduce the pain with bitwise NOT. +/* + * Pack 16-bit elements in the given vectors into a single vector of 8-bit + * elements. NB: The upper 8-bits of each 16-bit element must be zeros, else + * this will produce different results on different architectures. + */ v10 asserted this requirement -- that still seems like a good thing? -- John Naylor Amazon Web Services
diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c index 7ba92c2c481..eb932d4bd8a 100644 --- a/src/backend/utils/adt/encode.c +++ b/src/backend/utils/adt/encode.c @@ -317,6 +317,9 @@ static inline bool hex_decode_simd_helper(const Vector8 src, Vector8 *dst) { Vector8 sub; + // TODO: set one and use bitwise NOT for the other + Vector8 maskupper = vector8_set(0xff, 0, 0xff, 0, 0xff, 0, 0xff, 0, 0xff, 0, 0xff, 0, 0xff, 0, 0xff, 0); + Vector8 masklower = vector8_set(0, 0xff, 0, 0xff, 0, 0xff, 0, 0xff, 0, 0xff, 0, 0xff, 0, 0xff, 0, 0xff); Vector8 msk; bool ret; @@ -334,9 +337,9 @@ hex_decode_simd_helper(const Vector8 src, Vector8 *dst) *dst = vector8_issub(src, sub); ret = !vector8_has_ge(*dst, 0x10); - msk = vector8_and(*dst, vector8_broadcast_u32(0xff00ff00)); + msk = vector8_and(*dst, masklower); msk = vector8_shift_right_byte(msk); - *dst = vector8_and(*dst, vector8_broadcast_u32(0x00ff00ff)); + *dst = vector8_and(*dst, maskupper); *dst = vector8_shift_left_nibble(*dst); *dst = vector8_or(*dst, msk); return ret; diff --git a/src/include/port/simd.h b/src/include/port/simd.h index 531d8b8b6d1..56e810ce081 100644 --- a/src/include/port/simd.h +++ b/src/include/port/simd.h @@ -101,6 +101,33 @@ static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2); static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2); #endif +/* + * Populate a vector element-wise with the arguments. + */ +#ifndef USE_NO_SIMD +#if defined(USE_NEON) +// from a patch by Thomas Munro +static inline Vector8 +vector8_set(uint8 v0, uint8 v1, uint8 v2, uint8 v3, + uint8 v4, uint8 v5, uint8 v6, uint8 v7, + uint8 v8, uint8 v9, uint8 v10, uint8 v11, + uint8 v12, uint8 v13, uint8 v14, uint8 v15) +{ + uint8 pg_attribute_aligned(16) values[16] = { + v0, v1, v2, v3, + v4, v5, v6, v7, + v8, v9, v10, v11, + v12, v13, v14, v15 + }; + return vld1q_u8(values); +} +#elif defined(USE_SSE2) +#ifndef vector8_set +#define vector8_set(...) _mm_setr_epi8(__VA_ARGS__) +#endif +#endif +#endif /* ! USE_NO_SIMD */ + /* * Load a chunk of memory into the given vector. */ @@ -368,6 +395,7 @@ vector8_highbit_mask(const Vector8 v) * returns a uint64, making it inconvenient to combine mask values from * multiple vectors. */ + // TODO: use vector8_set static const uint8 mask[16] = { 1 << 0, 1 << 1, 1 << 2, 1 << 3, 1 << 4, 1 << 5, 1 << 6, 1 << 7,