On 13/02/2021 03:31, John Naylor wrote:
On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas <hlinn...@iki.fi
<mailto:hlinn...@iki.fi>> wrote:
>
> I also tested the fallback implementation from the simdjson library
> (included in the patch, if you uncomment it in simdjson-glue.c):
>
> mixed | ascii
> -------+-------
> 447 | 46
> (1 row)
>
> I think we should at least try to adopt that. At a high level, it looks
> pretty similar your patch: you load the data 8 bytes at a time, check if
> there are all ASCII. If there are any non-ASCII chars, you check the
> bytes one by one, otherwise you load the next 8 bytes. Your patch should
> be able to achieve the same performance, if done right. I don't think
> the simdjson code forbids \0 bytes, so that will add a few cycles, but
> still.
Attached is a patch that does roughly what simdjson fallback did, except
I use straight tests on the bytes and only calculate code points in
assertion builds. In the course of doing this, I found that my earlier
concerns about putting the ascii check in a static inline function were
due to my suboptimal loop implementation. I had assumed that if the
chunked ascii check failed, it had to check all those bytes one at a
time. As it turns out, that's a waste of the branch predictor. In the v2
patch, we do the chunked ascii check every time we loop. With that, I
can also confirm the claim in the Lemire paper that it's better to do
the check on 16-byte chunks:
(MacOS, Clang 10)
master:
chinese | mixed | ascii
---------+-------+-------
1081 | 761 | 366
v2 patch, with 16-byte stride:
chinese | mixed | ascii
---------+-------+-------
806 | 474 | 83
patch but with 8-byte stride:
chinese | mixed | ascii
---------+-------+-------
792 | 490 | 105
I also included the fast path in all other multibyte encodings, and that
is also pretty good performance-wise.
Cool.
It regresses from master on pure
multibyte input, but that case is still faster than PG13, which I
simulated by reverting 6c5576075b0f9 and b80e10638e3:
I thought the "chinese" numbers above are pure multibyte input, and it
seems to do well on that. Where does it regress? In multibyte encodings
other than UTF-8? How bad is the regression?
I tested this on my first generation Raspberry Pi (chipmunk). I had to
tweak it a bit to make it compile, since the SSE autodetection code was
not finished yet. And I used generate_series(1, 1000) instead of
generate_series(1, 10000) in the test script (mbverifystr-speed.sql)
because this system is so slow.
master:
mixed | ascii
-------+-------
1310 | 1041
(1 row)
v2-add-portability-stub-and-new-fallback.patch:
mixed | ascii
-------+-------
2979 | 910
(1 row)
I'm guessing that's because the unaligned access in check_ascii() is
expensive on this platform.
- Heikki