Hi, As of b80e10638e3, there is a new API for validating the encoding of strings, and one of the side effects is that we have a wider choice of algorithms. For UTF-8, it has been demonstrated that SIMD is much faster at decoding [1] and validation [2] than the standard approach we use.
It makes sense to start with the ascii subset of UTF-8 for a couple reasons. First, ascii is very widespread in database content, particularly in bulk loads. Second, ascii can be validated using the simple SSE2 intrinsics that come with (I believe) any x64-64 chip, and I'm guessing we can detect that at compile time and not mess with runtime checks. The examples above using SSE for the general case are much more complicated and involve SSE 4.2 or AVX. Here are some numbers on my laptop (MacOS/clang 10 -- if the concept is okay, I'll do Linux/gcc and add more inputs). The test is the same as Heikki shared in [3], but I added a case with >95% Chinese characters just to show how that compares to the mixed ascii/multibyte case. master: chinese | mixed | ascii ---------+-------+------- 1081 | 761 | 366 patch: chinese | mixed | ascii ---------+-------+------- 1103 | 498 | 51 The speedup in the pure ascii case is nice. In the attached POC, I just have a pro forma portability stub, and left full portability detection for later. The fast path is inlined inside pg_utf8_verifystr(). I imagine the ascii fast path could be abstracted into a separate function to which is passed a function pointer for full encoding validation. That would allow other encodings with strict ascii subsets to use this as well, but coding that abstraction might be a little messy, and b80e10638e3 already gives a performance boost over PG13. I also gave a shot at doing full UTF-8 recognition using a DFA, but so far that has made performance worse. If I ever have more success with that, I'll add that in the mix. [1] https://woboq.com/blog/utf-8-processing-using-simd.html [2] https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/ [3] https://www.postgresql.org/message-id/06d45421-61b8-86dd-e765-f1ce527a5...@iki.fi -- John Naylor EDB: http://www.enterprisedb.com
diff --git a/src/common/wchar.c b/src/common/wchar.c index 6e7d731e02..12b3a5e1a2 100644 --- a/src/common/wchar.c +++ b/src/common/wchar.c @@ -13,6 +13,10 @@ #include "c.h" #include "mb/pg_wchar.h" +#include "port/pg_bitutils.h" + +/* FIXME -- should go in src/include/port */ +#include <emmintrin.h> /* @@ -1762,6 +1766,80 @@ pg_utf8_verifystr(const unsigned char *s, int len) { const unsigned char *start = s; +#ifdef __x86_64__ + + + const __m128i zero = _mm_setzero_si128(); + __m128i chunk, + cmp; + + const int chunk_size = sizeof(__m128i); + int zero_mask, + highbit_mask, + ascii_count, + remainder; + + while (len >= chunk_size) + { + /* load next chunk */ + chunk = _mm_loadu_si128((const __m128i *) s); + + /* first detect any zero bytes */ + cmp = _mm_cmpeq_epi8(chunk, zero); + zero_mask = _mm_movemask_epi8(cmp); + + /* if there is a zero byte, let the slow path encounter it */ + if (zero_mask) + break; + + /* now check for non-ascii bytes */ + highbit_mask = _mm_movemask_epi8(chunk); + + if (!highbit_mask) + { + /* all ascii, so advance to the next chunk */ + s += chunk_size; + len -= chunk_size; + continue; + } + + /* + * if not all ascii, maybe there is a solid block of ascii + * at the beginning of the chunk. if so, skip it + */ + ascii_count = pg_rightmost_one_pos32(highbit_mask); + + s += ascii_count; + len -= ascii_count; + remainder = chunk_size - ascii_count; + + /* found non-ascii, so handle the remainder in the normal way */ + while (remainder > 0) + { + int l; + + /* + * fast path for ASCII-subset characters + * we already know they're non-zero + */ + if (!IS_HIGHBIT_SET(*s)) + l = 1; + else + { + l = pg_utf8_verifychar(s, len); + if (l == -1) + goto finish; + } + s += l; + len -= l; + remainder -= l; + + } + } + +#endif /* __x86_64__ */ + + /* handle last few bytes */ while (len > 0) { int l; @@ -1770,19 +1848,20 @@ pg_utf8_verifystr(const unsigned char *s, int len) if (!IS_HIGHBIT_SET(*s)) { if (*s == '\0') - break; + goto finish; l = 1; } else { l = pg_utf8_verifychar(s, len); if (l == -1) - break; + goto finish; } s += l; len -= l; } +finish: return s - start; }