Re: [POC] verifying UTF-8 using SIMD instructions

Heikki Linnakangas Mon, 15 Feb 2021 05:18:29 -0800

On 13/02/2021 03:31, John Naylor wrote:

On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas <hlinn...@iki.fi<mailto:hlinn...@iki.fi>> wrote:
 >
 > I also tested the fallback implementation from the simdjson library
 > (included in the patch, if you uncomment it in simdjson-glue.c):
 >
 >   mixed | ascii
 > -------+-------
 >     447 |    46
 > (1 row)
 >
 > I think we should at least try to adopt that. At a high level, it looks
 > pretty similar your patch: you load the data 8 bytes at a time, check if
 > there are all ASCII. If there are any non-ASCII chars, you check the
 > bytes one by one, otherwise you load the next 8 bytes. Your patch should
 > be able to achieve the same performance, if done right. I don't think
 > the simdjson code forbids \0 bytes, so that will add a few cycles, but
 > still.
Attached is a patch that does roughly what simdjson fallback did, exceptI use straight tests on the bytes and only calculate code points inassertion builds. In the course of doing this, I found that my earlierconcerns about putting the ascii check in a static inline function weredue to my suboptimal loop implementation. I had assumed that if thechunked ascii check failed, it had to check all those bytes one at atime. As it turns out, that's a waste of the branch predictor. In the v2patch, we do the chunked ascii check every time we loop. With that, Ican also confirm the claim in the Lemire paper that it's better to dothe check on 16-byte chunks:
(MacOS, Clang 10)

master:

  chinese | mixed | ascii
---------+-------+-------
     1081 |   761 |   366

v2 patch, with 16-byte stride:

  chinese | mixed | ascii
---------+-------+-------
      806 |   474 |    83

patch but with 8-byte stride:

  chinese | mixed | ascii
---------+-------+-------
      792 |   490 |   105
I also included the fast path in all other multibyte encodings, and thatis also pretty good performance-wise.


Cool.

It regresses from master on puremultibyte input, but that case is still faster than PG13, which Isimulated by reverting 6c5576075b0f9 and b80e10638e3:

I thought the "chinese" numbers above are pure multibyte input, and itseems to do well on that. Where does it regress? In multibyte encodingsother than UTF-8? How bad is the regression?

I tested this on my first generation Raspberry Pi (chipmunk). I had totweak it a bit to make it compile, since the SSE autodetection code wasnot finished yet. And I used generate_series(1, 1000) instead ofgenerate_series(1, 10000) in the test script (mbverifystr-speed.sql)because this system is so slow.


master:

 mixed | ascii
-------+-------
  1310 |  1041
(1 row)

v2-add-portability-stub-and-new-fallback.patch:

 mixed | ascii
-------+-------
  2979 |   910
(1 row)

I'm guessing that's because the unaligned access in check_ascii() isexpensive on this platform.


- Heikki

Re: [POC] verifying UTF-8 using SIMD instructions

Reply via email to