On Wed, Jul 21, 2021 at 8:08 PM Thomas Munro <thomas.mu...@gmail.com> wrote: > > On Thu, Jul 22, 2021 at 6:16 AM John Naylor
> One question is whether this "one size fits all" approach will be > extensible to wider SIMD. Sure, it'll just take a little more work and complexity. For one, 16-byte SIMD can operate on 32-byte chunks with a bit of repetition: - __m128i input; + __m128i input1; + __m128i input2; -#define SIMD_STRIDE_LENGTH (sizeof(__m128i)) +#define SIMD_STRIDE_LENGTH 32 while (len >= SIMD_STRIDE_LENGTH) { - input = vload(s); + input1 = vload(s); + input2 = vload(s + sizeof(input1)); - check_for_zeros(input, &error); + check_for_zeros(input1, &error); + check_for_zeros(input2, &error); /* * If the chunk is all ASCII, we can skip the full UTF-8 check, but we @@ -460,17 +463,18 @@ pg_validate_utf8_sse42(const unsigned char *s, int len) * sequences at the end. We only update prev_incomplete if the chunk * contains non-ASCII, since the error is cumulative. */ - if (is_highbit_set(input)) + if (is_highbit_set(bitwise_or(input1, input2))) { - check_utf8_bytes(prev, input, &error); - prev_incomplete = is_incomplete(input); + check_utf8_bytes(prev, input1, &error); + check_utf8_bytes(input1, input2, &error); + prev_incomplete = is_incomplete(input2); } else { error = bitwise_or(error, prev_incomplete); } - prev = input; + prev = input2; s += SIMD_STRIDE_LENGTH; len -= SIMD_STRIDE_LENGTH; } So with a few #ifdefs, we can accommodate two sizes if we like. For another, the prevN() functions would need to change, at least on x86 -- that would require replacing _mm_alignr_epi8() with _mm256_alignr_epi8() plus _mm256_permute2x128_si256(). Also, we might have to do something with the vector typedef. That said, I think we can punt on that until we have an application that's much more compute-intensive. As it is with SSE4, COPY FROM WHERE <selective predicate> already pushes the utf8 validation way down in profiles. > FWIW here are some performance results from my humble RPI4: > > master: > > chinese | mixed | ascii > ---------+-------+------- > 4172 | 2763 | 1823 > (1 row) > > Your v15 patch: > > chinese | mixed | ascii > ---------+-------+------- > 2267 | 1248 | 399 > (1 row) > > Your v15 patch set + the NEON patch, configured with USE_UTF8_SIMD=1: > > chinese | mixed | ascii > ---------+-------+------- > 909 | 620 | 318 > (1 row) > > It's so good I wonder if it's producing incorrect results :-) Nice! If it passes regression tests, it *should* be fine, but stress testing would be welcome on any platform. > I also tried to do a quick and dirty AltiVec patch to see if it could > fit into the same code "shape", with less immediate success: it works > out slower than the fallback code on the POWER7 machine I scrounged an > account on. I'm not sure what's wrong there, but maybe it's a uesful > start (I'm probably confused about endianness, or the encoding of > boolean vectors which may be different (is true 0x01or 0xff, does it > matter?), or something else, and it's falling back on errors all the > time?). Hmm, I have access to a power8 machine to play with, but I also don't mind having some type of server-class hardware that relies on the recent nifty DFA fallback, which performs even better on powerpc64le than v15. -- John Naylor EDB: http://www.enterprisedb.com