[POC] verifying UTF-8 using SIMD instructions

John Naylor Mon, 01 Feb 2021 09:32:52 -0800

Hi,

As of b80e10638e3, there is a new API for validating the encoding of
strings, and one of the side effects is that we have a wider choice of
algorithms. For UTF-8, it has been demonstrated that SIMD is much faster at
decoding [1] and validation [2] than the standard approach we use.


It makes sense to start with the ascii subset of UTF-8 for a couple
reasons. First, ascii is very widespread in database content, particularly
in bulk loads. Second, ascii can be validated using the simple SSE2
intrinsics that come with (I believe) any x64-64 chip, and I'm guessing we
can detect that at compile time and not mess with runtime checks. The
examples above using SSE for the general case are much more complicated and
involve SSE 4.2 or AVX.

Here are some numbers on my laptop (MacOS/clang 10 -- if the concept is
okay, I'll do Linux/gcc and add more inputs). The test is the same as
Heikki shared in [3], but I added a case with >95% Chinese characters just
to show how that compares to the mixed ascii/multibyte case.

master:

 chinese | mixed | ascii
---------+-------+-------
    1081 |   761 |   366

patch:

 chinese | mixed | ascii
---------+-------+-------
    1103 |   498 |    51

The speedup in the pure ascii case is nice.

In the attached POC, I just have a pro forma portability stub, and left
full portability detection for later. The fast path is inlined inside
pg_utf8_verifystr(). I imagine the ascii fast path could be abstracted into
a separate function to which is passed a function pointer for full encoding
validation. That would allow other encodings with strict ascii subsets to
use this as well, but coding that abstraction might be a little messy, and
b80e10638e3 already gives a performance boost over PG13.

I also gave a shot at doing full UTF-8 recognition using a DFA, but so far
that has made performance worse. If I ever have more success with that,
I'll add that in the mix.

[1] https://woboq.com/blog/utf-8-processing-using-simd.html
[2]
https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/
[3]
https://www.postgresql.org/message-id/[email protected]

-- 
John Naylor
EDB: http://www.enterprisedb.com

diff --git a/src/common/wchar.c b/src/common/wchar.c
index 6e7d731e02..12b3a5e1a2 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -13,6 +13,10 @@
 #include "c.h"
 
 #include "mb/pg_wchar.h"
+#include "port/pg_bitutils.h"
+
+/* FIXME -- should go in src/include/port */
+#include <emmintrin.h>
 
 
 /*
@@ -1762,6 +1766,80 @@ pg_utf8_verifystr(const unsigned char *s, int len)
 {
 	const unsigned char *start = s;
 
+#ifdef __x86_64__
+
+
+	const __m128i	zero = _mm_setzero_si128();
+	__m128i			chunk,
+					cmp;
+
+	const int		chunk_size = sizeof(__m128i);
+	int				zero_mask,
+					highbit_mask,
+					ascii_count,
+					remainder;
+
+	while (len >= chunk_size)
+	{
+		/* load next chunk */
+		chunk = _mm_loadu_si128((const __m128i *) s);
+
+		/* first detect any zero bytes */
+		cmp = _mm_cmpeq_epi8(chunk, zero);
+		zero_mask = _mm_movemask_epi8(cmp);
+
+		/* if there is a zero byte, let the slow path encounter it */
+		if (zero_mask)
+			break;
+
+		/* now check for non-ascii bytes */
+		highbit_mask = _mm_movemask_epi8(chunk);
+
+		if (!highbit_mask)
+		{
+			/* all ascii, so advance to the next chunk */
+			s += chunk_size;
+			len -= chunk_size;
+			continue;
+		}
+
+		/*
+		 * if not all ascii, maybe there is a solid block of ascii
+		 * at the beginning of the chunk. if so, skip it
+		 */
+		ascii_count = pg_rightmost_one_pos32(highbit_mask);
+
+		s += ascii_count;
+		len -= ascii_count;
+		remainder = chunk_size - ascii_count;
+
+		/* found non-ascii, so handle the remainder in the normal way */
+		while (remainder > 0)
+		{
+			int			l;
+
+			/*
+			 * fast path for ASCII-subset characters
+			 * we already know they're non-zero
+			 */
+			if (!IS_HIGHBIT_SET(*s))
+				l = 1;
+			else
+			{
+				l = pg_utf8_verifychar(s, len);
+				if (l == -1)
+					goto finish;
+			}
+			s += l;
+			len -= l;
+			remainder -= l;
+
+		}
+	}
+
+#endif							/* __x86_64__ */
+
+	/* handle last few bytes */
 	while (len > 0)
 	{
 		int			l;
@@ -1770,19 +1848,20 @@ pg_utf8_verifystr(const unsigned char *s, int len)
 		if (!IS_HIGHBIT_SET(*s))
 		{
 			if (*s == '\0')
-				break;
+				goto finish;
 			l = 1;
 		}
 		else
 		{
 			l = pg_utf8_verifychar(s, len);
 			if (l == -1)
-				break;
+				goto finish;
 		}
 		s += l;
 		len -= l;
 	}
 
+finish:
 	return s - start;
 }

[POC] verifying UTF-8 using SIMD instructions

Reply via email to