On 01/02/2021 19:32, John Naylor wrote:
It makes sense to start with the ascii subset of UTF-8 for a couple
reasons. First, ascii is very widespread in database content,
particularly in bulk loads. Second, ascii can be validated using the
simple SSE2 intrinsics that come with (I believe) any x64-64 chip, and
I'm guessing we can detect that at compile time and not mess with
runtime checks. The examples above using SSE for the general case are
much more complicated and involve SSE 4.2 or AVX.
I wonder how using SSE compares with dealing with 64 or 32-bit words at
a time, using regular instructions? That would be more portable.
Here are some numbers on my laptop (MacOS/clang 10 -- if the concept is
okay, I'll do Linux/gcc and add more inputs). The test is the same as
Heikki shared in [3], but I added a case with >95% Chinese characters
just to show how that compares to the mixed ascii/multibyte case.
master:
chinese | mixed | ascii
---------+-------+-------
1081 | 761 | 366
patch:
chinese | mixed | ascii
---------+-------+-------
1103 | 498 | 51
The speedup in the pure ascii case is nice.
Yep.
In the attached POC, I just have a pro forma portability stub, and left
full portability detection for later. The fast path is inlined inside
pg_utf8_verifystr(). I imagine the ascii fast path could be abstracted
into a separate function to which is passed a function pointer for full
encoding validation. That would allow other encodings with strict ascii
subsets to use this as well, but coding that abstraction might be a
little messy, and b80e10638e3 already gives a performance boost over PG13.
All supported encodings are ASCII subsets. Might be best to putt the
ASCII-check into a static inline function and use it in all the verify
functions. I presume it's only a few instructions, and these functions
can be pretty performance sensitive.
I also gave a shot at doing full UTF-8 recognition using a DFA, but so
far that has made performance worse. If I ever have more success with
that, I'll add that in the mix.
That's disappointing. Perhaps the SIMD algorithms have higher startup
costs, so that you need longer inputs to benefit? In that case, it might
make sense to check the length of the input and only use the SIMD
algorithm if the input is long enough.
- Heikki