Re: [POC] verifying UTF-8 using SIMD instructions

Heikki Linnakangas Mon, 01 Feb 2021 10:02:16 -0800

On 01/02/2021 19:32, John Naylor wrote:

It makes sense to start with the ascii subset of UTF-8 for a couplereasons. First, ascii is very widespread in database content,particularly in bulk loads. Second, ascii can be validated using thesimple SSE2 intrinsics that come with (I believe) any x64-64 chip, andI'm guessing we can detect that at compile time and not mess withruntime checks. The examples above using SSE for the general case aremuch more complicated and involve SSE 4.2 or AVX.

I wonder how using SSE compares with dealing with 64 or 32-bit words ata time, using regular instructions? That would be more portable.

Here are some numbers on my laptop (MacOS/clang 10 -- if the concept isokay, I'll do Linux/gcc and add more inputs). The test is the same asHeikki shared in [3], but I added a case with >95% Chinese charactersjust to show how that compares to the mixed ascii/multibyte case.
master:

  chinese | mixed | ascii
---------+-------+-------
     1081 |   761 |   366

patch:

  chinese | mixed | ascii
---------+-------+-------
     1103 |   498 |    51

The speedup in the pure ascii case is nice.


Yep.

In the attached POC, I just have a pro forma portability stub, and leftfull portability detection for later. The fast path is inlined insidepg_utf8_verifystr(). I imagine the ascii fast path could be abstractedinto a separate function to which is passed a function pointer for fullencoding validation. That would allow other encodings with strict asciisubsets to use this as well, but coding that abstraction might be alittle messy, and b80e10638e3 already gives a performance boost over PG13.

All supported encodings are ASCII subsets. Might be best to putt theASCII-check into a static inline function and use it in all the verifyfunctions. I presume it's only a few instructions, and these functionscan be pretty performance sensitive.

I also gave a shot at doing full UTF-8 recognition using a DFA, but sofar that has made performance worse. If I ever have more success withthat, I'll add that in the mix.

That's disappointing. Perhaps the SIMD algorithms have higher startupcosts, so that you need longer inputs to benefit? In that case, it mightmake sense to check the length of the input and only use the SIMDalgorithm if the input is long enough.


- Heikki

Re: [POC] verifying UTF-8 using SIMD instructions

Reply via email to