Re: [POC] verifying UTF-8 using SIMD instructions

2021-07-22 Thread John Naylor
On Wed, Jul 21, 2021 at 8:08 PM Thomas Munro wrote: > > On Thu, Jul 22, 2021 at 6:16 AM John Naylor > One question is whether this "one size fits all" approach will be > extensible to wider SIMD. Sure, it'll just take a little more work and complexity. For one, 16-byte SIMD can operate on 32-byt

Re: [POC] verifying UTF-8 using SIMD instructions

2021-07-21 Thread Thomas Munro
On Thu, Jul 22, 2021 at 6:16 AM John Naylor wrote: > Neat! It's good to make it more architecture-agnostic, and I'm sure we can > use quite a bit of this. One question is whether this "one size fits all" approach will be extensible to wider SIMD. > to_bool(const pg_u8x16_t v) > { > +#if defin

Re: [POC] verifying UTF-8 using SIMD instructions

2021-07-21 Thread John Naylor
On Wed, Jul 21, 2021 at 11:29 AM Thomas Munro wrote: > Just for fun/experimentation, here's a quick (and probably too naive) > translation of those helper functions to NEON, on top of the v15 > patch. Neat! It's good to make it more architecture-agnostic, and I'm sure we can use quite a bit of t

Re: [POC] verifying UTF-8 using SIMD instructions

2021-07-21 Thread Thomas Munro
On Sat, Mar 13, 2021 at 4:37 AM John Naylor wrote: > On Fri, Mar 12, 2021 at 9:14 AM Amit Khandekar wrote: > > I was not thinking about auto-vectorizing the code in > > pg_validate_utf8_sse42(). Rather, I was considering auto-vectorization > > inside the individual helper functions that you wrote

Re: [POC] verifying UTF-8 using SIMD instructions

2021-04-01 Thread John Naylor
v9 is just a rebase. -- John Naylor EDB: http://www.enterprisedb.com From e876049ad3b153e8725ab23f65ae8f021a970470 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Thu, 1 Apr 2021 08:24:05 -0400 Subject: [PATCH v9] Replace pg_utf8_verifystr() with two faster implementations: On x86-64, we use S

Re: [POC] verifying UTF-8 using SIMD instructions

2021-03-12 Thread John Naylor
On Fri, Mar 12, 2021 at 9:14 AM Amit Khandekar wrote: > > On my Arm64 VM : > > HEAD : > mixed | ascii > ---+--- > 1091 | 628 > (1 row) > > PATCHED : > mixed | ascii > ---+--- >681 | 119 Thanks for testing! Good, the speedup is about as much as I can hope for using plai

Re: [POC] verifying UTF-8 using SIMD instructions

2021-03-12 Thread Amit Khandekar
On Tue, 9 Mar 2021 at 17:14, John Naylor wrote: > On Tue, Mar 9, 2021 at 5:00 AM Amit Khandekar wrote: > > Just a quick question before I move on to review the patch ... The > > improvement looks like it is only meant for x86 platforms. > > Actually it's meant to be faster for all platforms, sinc

Re: [POC] verifying UTF-8 using SIMD instructions

2021-03-09 Thread John Naylor
On Tue, Mar 9, 2021 at 5:00 AM Amit Khandekar wrote: > > Hi, > > Just a quick question before I move on to review the patch ... The > improvement looks like it is only meant for x86 platforms. Actually it's meant to be faster for all platforms, since the C fallback is quite a bit different from H

Re: [POC] verifying UTF-8 using SIMD instructions

2021-03-09 Thread Amit Khandekar
Hi, Just a quick question before I move on to review the patch ... The improvement looks like it is only meant for x86 platforms. Can this be done in a portable way by arranging for auto-vectorization ? Something like commit 88709176236caf. This way it would benefit other platforms as well. I tri

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-18 Thread John Naylor
On Mon, Feb 15, 2021 at 9:32 PM John Naylor wrote: > > On Mon, Feb 15, 2021 at 9:18 AM Heikki Linnakangas wrote: > > > > I'm guessing that's because the unaligned access in check_ascii() is > > expensive on this platform. > Some possible remedies: > 3) #ifdef out the ascii check for 32-bit plat

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-16 Thread John Naylor
I wrote: > [v3] > - It's not smart enough to stop at the last valid character boundary -- it's either all-valid or it must start over with the fallback. That will have to change in order to work with the proposed noError conversions. It shouldn't be very hard, but needs thought as to the clearest

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-15 Thread John Naylor
On Mon, Feb 15, 2021 at 9:18 AM Heikki Linnakangas wrote: > Attached is the first attempt at using SSE4 to do the validation, but first I'll answer your questions about the fallback. I should mention that v2 had a correctness bug for 4-byte characters that I found when I was writing regression t

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-15 Thread Heikki Linnakangas
On 13/02/2021 03:31, John Naylor wrote: On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas > wrote: > > I also tested the fallback implementation from the simdjson library > (included in the patch, if you uncomment it in simdjson-glue.c): > >   mixed | ascii > ---

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-12 Thread John Naylor
On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas wrote: > > I also tested the fallback implementation from the simdjson library > (included in the patch, if you uncomment it in simdjson-glue.c): > > mixed | ascii > ---+--- > 447 |46 > (1 row) > > I think we should at least try t

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-09 Thread John Naylor
On Tue, Feb 9, 2021 at 4:22 PM Heikki Linnakangas wrote: > > On 09/02/2021 22:08, John Naylor wrote: > > Maybe there's a smarter way to check for zeros in C. Or maybe be more > > careful about cache -- running memchr() on the whole input first might > > not be the best thing to do. > > The usual t

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-09 Thread John Naylor
I wrote: > > On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas wrote: > One of his earlier demos [1] (in simdutf8check.h) had a version that used mostly SSE2 with just three intrinsics from SSSE3. That's widely available by now. He measured that at 0.7 cycles per byte, which is still good compared

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-09 Thread Heikki Linnakangas
On 09/02/2021 22:08, John Naylor wrote: Maybe there's a smarter way to check for zeros in C. Or maybe be more careful about cache -- running memchr() on the whole input first might not be the best thing to do. The usual trick is the haszero() macro here: https://graphics.stanford.edu/~seander

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-09 Thread John Naylor
On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas wrote: > > I also tested the fallback implementation from the simdjson library > (included in the patch, if you uncomment it in simdjson-glue.c): > > mixed | ascii > ---+--- > 447 |46 > (1 row) > > I think we should at least try t

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-08 Thread John Naylor
On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas wrote: > As a quick test, I hacked up pg_utf8_verifystr() to use Lemire's > algorithm from the simdjson library [1], see attached patch. I > microbenchmarked it using the the same test I used before [2]. I've been looking at various iterations of

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-08 Thread Heikki Linnakangas
On 07/02/2021 22:24, John Naylor wrote: Here is a more polished version of the function pointer approach, now adapted to all multibyte encodings. Using the not-yet-committed tests from [1], I found a thinko bug that resulted in the test for nul bytes to not only be wrong, but probably also elid

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-07 Thread John Naylor
Here is a more polished version of the function pointer approach, now adapted to all multibyte encodings. Using the not-yet-committed tests from [1], I found a thinko bug that resulted in the test for nul bytes to not only be wrong, but probably also elided by the compiler. Doing it correctly is no

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-04 Thread John Naylor
On Mon, Feb 1, 2021 at 2:01 PM Heikki Linnakangas wrote: > > On 01/02/2021 19:32, John Naylor wrote: > > It makes sense to start with the ascii subset of UTF-8 for a couple > > reasons. First, ascii is very widespread in database content, > > particularly in bulk loads. Second, ascii can be valida

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-01 Thread Heikki Linnakangas
On 01/02/2021 19:32, John Naylor wrote: It makes sense to start with the ascii subset of UTF-8 for a couple reasons. First, ascii is very widespread in database content, particularly in bulk loads. Second, ascii can be validated using the simple SSE2 intrinsics that come with (I believe) any x6

[POC] verifying UTF-8 using SIMD instructions

2021-02-01 Thread John Naylor
Hi, As of b80e10638e3, there is a new API for validating the encoding of strings, and one of the side effects is that we have a wider choice of algorithms. For UTF-8, it has been demonstrated that SIMD is much faster at decoding [1] and validation [2] than the standard approach we use. It makes s