Re: [POC] verifying UTF-8 using SIMD instructions

John Naylor Thu, 18 Feb 2021 16:43:41 -0800

On Mon, Feb 15, 2021 at 9:32 PM John Naylor <john.nay...@enterprisedb.com>
wrote:
>
> On Mon, Feb 15, 2021 at 9:18 AM Heikki Linnakangas <hlinn...@iki.fi>
wrote:
> >
> > I'm guessing that's because the unaligned access in check_ascii() is
> > expensive on this platform.


> Some possible remedies:

> 3) #ifdef out the ascii check for 32-bit platforms.

> 4) Same as the non-UTF8 case -- only check for ascii 8 bytes at a time.
I'll probably try this first.

I've attached a couple patches to try on top of v4; maybe they'll help the
Arm32 regression. 01 reduces the stride to 8 bytes, and 02 applies on top
of v1 to disable the fallback fast path entirely on 32-bit platforms. A bit
of a heavy hammer, but it'll confirm (or not) your theory about unaligned
loads.

Also, I've included patches to explain more fully how I modeled non-UTF-8
performance while still using the UTF-8 tests. I think it was a useful
thing to do, and I have a theory that might predict how a non-UTF8 encoding
will perform with the fast path.

03A and 03B are independent of each other and conflict, but both apply on
top of v4 (don't need 02). Both replace the v4 fallback with the ascii
fastpath + pg_utf8_verifychar() in the loop, similar to utf-8 on master.
03A has a local static copy of pg_utf8_islegal(), and 03B uses the existing
global function. (On x86, you can disable SSE4 by passing
USE_FALLBACK_UTF8=1 to configure.)

While Clang 10 regressed for me on pure multibyte in a similar test
upthread, on Linux gcc 8.4 there isn't a regression at all. IIRC, gcc
wasn't as good as Clang when the API changed a few weeks ago, so its
regression from v4 is still faster than master. Clang only regressed with
my changes because it somehow handled master much better to begin with.

x86-64 Linux gcc 8.4

master

 chinese | mixed | ascii
---------+-------+-------
    1453 |   857 |   428

v4 (fallback verifier written as a single function)

 chinese | mixed | ascii
---------+-------+-------
     815 |   514 |    82

v4 plus addendum 03A -- emulate non-utf-8 using a copy of
pg_utf8_is_legal() as a static function

 chinese | mixed | ascii
---------+-------+-------
    1115 |   547 |    87

v4 plus addendum 03B -- emulate non-utf-8 using pg_utf8_is_legal() as a
global function

 chinese | mixed | ascii
---------+-------+-------
    1279 |   604 |    82

(I also tried the same on ppc64le Linux, gcc 4.8.5 and while not great, it
never got worse than master either on pure multibyte.)

This is supposed to model the performance of a non-utf8 encoding, where we
don't have a bespoke function written from scratch. Here's my theory: If an
encoding has pg_*_mblen(), a global function, inside pg_*_verifychar(), it
seems it won't benefit as much from an ascii fast path as one whose
pg_*_verifychar() has no function calls. I'm not sure whether a compiler
can inline a global function's body into call sites in the unit where it's
defined. (I haven't looked at the assembly.) But recall that you didn't
commit 0002 from the earlier encoding change, because it wasn't performing.
I looked at that patch again, and while it inlined the pg_utf8_verifychar()
call, it still called the global function pg_utf8_islegal().

If the above is anything to go by, on gcc at least, I don't think we need
to worry about a regression when adding an ascii fast path to non-utf-8
multibyte encodings.

Regarding SSE, I've added an ascii fast path in my local branch, but it's
not going to be as big a difference because 1) the check is more expensive
in terms of branches than the C case, and 2) because the general case is so
fast already, it's hard to improve upon. I just need to do some testing and
cleanup on the whole thing, and that'll be ready to share.

--
John Naylor
EDB: http://www.enterprisedb.com

diff --git a/src/include/port/pg_utf8.h b/src/include/port/pg_utf8.h
index dac2afc130..a0c94dd4f3 100644
--- a/src/include/port/pg_utf8.h
+++ b/src/include/port/pg_utf8.h
@@ -53,26 +53,25 @@ extern int pg_validate_utf8_fallback(const unsigned char *s, int len);
 static inline int
 check_ascii(const unsigned char *s, int len)
 {
-	uint64		half1, half2,
+	uint64		chunk,
 				highbit_mask;
 
-	if  (len >= 2 * sizeof(uint64))
+	if  (len >= sizeof(uint64))
 	{
-		memcpy(&half1, s, sizeof(uint64));
-		memcpy(&half2, s + sizeof(uint64), sizeof(uint64));
+		memcpy(&chunk, s, sizeof(uint64));
 
 		/*
 		 * If there are any zero bytes, bail and let the slow
 		 * path handle it.
 		 */
-		if (HAS_ZERO(half1) || HAS_ZERO(half2))
+		if (HAS_ZERO(chunk))
 			return 0;
 
 		/* Check if any bytes in this chunk have the high bit set. */
-		highbit_mask = ((half1 | half2) & UINT64CONST(0x8080808080808080));
+		highbit_mask = (chunk & UINT64CONST(0x8080808080808080));
 
 		if (!highbit_mask)
-			return 2 * sizeof(uint64);
+			return sizeof(uint64);
 		else
 			return 0;
 	}

diff --git a/src/include/port/pg_utf8.h b/src/include/port/pg_utf8.h
index a0c94dd4f3..62272a7c93 100644
--- a/src/include/port/pg_utf8.h
+++ b/src/include/port/pg_utf8.h
@@ -53,6 +53,7 @@ extern int pg_validate_utf8_fallback(const unsigned char *s, int len);
 static inline int
 check_ascii(const unsigned char *s, int len)
 {
+#if SIZEOF_VOID_P >= 8
 	uint64		chunk,
 				highbit_mask;
 
@@ -75,7 +76,9 @@ check_ascii(const unsigned char *s, int len)
 		else
 			return 0;
 	}
-
+	else
+		return 0;
+#endif
 	return 0;
 }

diff --git a/src/port/pg_utf8_fallback.c b/src/port/pg_utf8_fallback.c
index 1615c48233..af1d636331 100644
--- a/src/port/pg_utf8_fallback.c
+++ b/src/port/pg_utf8_fallback.c
@@ -17,8 +17,91 @@
 
 #include "port/pg_utf8.h"
 
+static bool
+pg_utf8_islegal(const unsigned char *source, int length)
+{
+	unsigned char a;
+
+	switch (length)
+	{
+		default:
+			/* reject lengths 5 and 6 for now */
+			return false;
+		case 4:
+			a = source[3];
+			if (a < 0x80 || a > 0xBF)
+				return false;
+			/* FALL THRU */
+		case 3:
+			a = source[2];
+			if (a < 0x80 || a > 0xBF)
+				return false;
+			/* FALL THRU */
+		case 2:
+			a = source[1];
+			switch (*source)
+			{
+				case 0xE0:
+					if (a < 0xA0 || a > 0xBF)
+						return false;
+					break;
+				case 0xED:
+					if (a < 0x80 || a > 0x9F)
+						return false;
+					break;
+				case 0xF0:
+					if (a < 0x90 || a > 0xBF)
+						return false;
+					break;
+				case 0xF4:
+					if (a < 0x80 || a > 0x8F)
+						return false;
+					break;
+				default:
+					if (a < 0x80 || a > 0xBF)
+						return false;
+					break;
+			}
+			/* FALL THRU */
+		case 1:
+			a = *source;
+			if (a >= 0x80 && a < 0xC2)
+				return false;
+			if (a > 0xF4)
+				return false;
+			break;
+	}
+	return true;
+}
+
+static int
+pg_utf8_verifychar(const unsigned char *s, int len)
+{
+	int			l;
 
-#define IS_CONTINUATION_BYTE(c) (((c) & 0b11000000) == 0b10000000)
+	if ((*s & 0x80) == 0)
+	{
+		if (*s == '\0')
+			return -1;
+		return 1;
+	}
+	else if ((*s & 0xe0) == 0xc0)
+		l = 2;
+	else if ((*s & 0xf0) == 0xe0)
+		l = 3;
+	else if ((*s & 0xf8) == 0xf0)
+		l = 4;
+	else
+		l = 1;
+
+	if (l > len)
+		return -1;
+
+	if (!pg_utf8_islegal(s, l))
+		return -1;
+
+	return l;
+}
 
 /*
  * See the comment in common/wchar.c under "multibyte sequence validators".
@@ -27,7 +110,6 @@ int
 pg_validate_utf8_fallback(const unsigned char *s, int len)
 {
 	const unsigned char *start = s;
-	unsigned char b1, b2, b3, b4;
 
 	while (len > 0)
 	{
@@ -49,81 +131,12 @@ pg_validate_utf8_fallback(const unsigned char *s, int len)
 				break;
 			l = 1;
 		}
-		/* code points U+0080 through U+07FF */
-		else if ((*s & 0b11100000) == 0b11000000)
-		{
-			l = 2;
-			if (len < l)
-				break;
-
-			b1 = *s;
-			b2 = *(s + 1);
-
-			if (!IS_CONTINUATION_BYTE(b2))
-				break;
-
-			/* check 2-byte overlong: 1100.000x.10xx.xxxx */
-			if (b1 < 0xC2)
-				break;
-		}
-		/* code points U+0800 through U+D7FF and U+E000 through U+FFFF */
-		else if ((*s & 0b11110000) == 0b11100000)
-		{
-			l = 3;
-			if (len < l)
-				break;
-
-			b1 = *s;
-			b2 = *(s + 1);
-			b3 = *(s + 2);
-
-			if (!IS_CONTINUATION_BYTE(b2) ||
-				!IS_CONTINUATION_BYTE(b3))
-				break;
-
-			/* check 3-byte overlong: 1110.0000 1001.xxxx 10xx.xxxx */
-			if (b1 == 0xE0 && b2 < 0xA0)
-				break;
-
-			/* check surrogate: 1110.1101 101x.xxxx 10xx.xxxx */
-			if (b1 == 0xED && b2 > 0x9F)
-				break;
-		}
-		/* code points U+010000 through U+10FFFF */
-		else if ((*s & 0b11111000) == 0b11110000)
+		else
 		{
-			l = 4;
-			if (len < l)
-				break;
-
-			b1 = *s;
-			b2 = *(s + 1);
-			b3 = *(s + 2);
-			b4 = *(s + 3);
-
-			if (!IS_CONTINUATION_BYTE(b2) ||
-				!IS_CONTINUATION_BYTE(b3) ||
-				!IS_CONTINUATION_BYTE(b4))
-				break;
-
-			/*
-			 * check 4-byte overlong:
-			 * 1111.0000 1000.xxxx 10xx.xxxx 10xx.xxxx
-			 */
-			if (b1 == 0xF0 && b2 < 0x90)
-				break;
-
-			/*
-			 * check too large:
-			 * 1111.0100 1001.xxxx 10xx.xxxx 10xx.xxxx
-			 */
-			if ((b1 == 0xF4 && b2 > 0x8F) || b1 > 0xF4)
+			l = pg_utf8_verifychar(s, len);
+			if (l == -1)
 				break;
 		}
-		else
-			/* invalid byte */
-			break;
-
 		s += l;
 		len -= l;
 	}

diff --git a/src/port/pg_utf8_fallback.c b/src/port/pg_utf8_fallback.c
index 1615c48233..355f44085b 100644
--- a/src/port/pg_utf8_fallback.c
+++ b/src/port/pg_utf8_fallback.c
@@ -16,9 +16,36 @@
 #include "c.h"
 
 #include "port/pg_utf8.h"
+#include "mb/pg_wchar.h"
 
+static int
+pg_utf8_verifychar(const unsigned char *s, int len)
+{
+	int			l;
 
-#define IS_CONTINUATION_BYTE(c) (((c) & 0b11000000) == 0b10000000)
+	if ((*s & 0x80) == 0)
+	{
+		if (*s == '\0')
+			return -1;
+		return 1;
+	}
+	else if ((*s & 0xe0) == 0xc0)
+		l = 2;
+	else if ((*s & 0xf0) == 0xe0)
+		l = 3;
+	else if ((*s & 0xf8) == 0xf0)
+		l = 4;
+	else
+		l = 1;
+
+	if (l > len)
+		return -1;
+
+	if (!pg_utf8_islegal(s, l))
+		return -1;
+
+	return l;
+}
 
 /*
  * See the comment in common/wchar.c under "multibyte sequence validators".
@@ -27,7 +54,6 @@ int
 pg_validate_utf8_fallback(const unsigned char *s, int len)
 {
 	const unsigned char *start = s;
-	unsigned char b1, b2, b3, b4;
 
 	while (len > 0)
 	{
@@ -49,81 +75,12 @@ pg_validate_utf8_fallback(const unsigned char *s, int len)
 				break;
 			l = 1;
 		}
-		/* code points U+0080 through U+07FF */
-		else if ((*s & 0b11100000) == 0b11000000)
-		{
-			l = 2;
-			if (len < l)
-				break;
-
-			b1 = *s;
-			b2 = *(s + 1);
-
-			if (!IS_CONTINUATION_BYTE(b2))
-				break;
-
-			/* check 2-byte overlong: 1100.000x.10xx.xxxx */
-			if (b1 < 0xC2)
-				break;
-		}
-		/* code points U+0800 through U+D7FF and U+E000 through U+FFFF */
-		else if ((*s & 0b11110000) == 0b11100000)
-		{
-			l = 3;
-			if (len < l)
-				break;
-
-			b1 = *s;
-			b2 = *(s + 1);
-			b3 = *(s + 2);
-
-			if (!IS_CONTINUATION_BYTE(b2) ||
-				!IS_CONTINUATION_BYTE(b3))
-				break;
-
-			/* check 3-byte overlong: 1110.0000 1001.xxxx 10xx.xxxx */
-			if (b1 == 0xE0 && b2 < 0xA0)
-				break;
-
-			/* check surrogate: 1110.1101 101x.xxxx 10xx.xxxx */
-			if (b1 == 0xED && b2 > 0x9F)
-				break;
-		}
-		/* code points U+010000 through U+10FFFF */
-		else if ((*s & 0b11111000) == 0b11110000)
+		else
 		{
-			l = 4;
-			if (len < l)
-				break;
-
-			b1 = *s;
-			b2 = *(s + 1);
-			b3 = *(s + 2);
-			b4 = *(s + 3);
-
-			if (!IS_CONTINUATION_BYTE(b2) ||
-				!IS_CONTINUATION_BYTE(b3) ||
-				!IS_CONTINUATION_BYTE(b4))
-				break;
-
-			/*
-			 * check 4-byte overlong:
-			 * 1111.0000 1000.xxxx 10xx.xxxx 10xx.xxxx
-			 */
-			if (b1 == 0xF0 && b2 < 0x90)
-				break;
-
-			/*
-			 * check too large:
-			 * 1111.0100 1001.xxxx 10xx.xxxx 10xx.xxxx
-			 */
-			if ((b1 == 0xF4 && b2 > 0x8F) || b1 > 0xF4)
+			l = pg_utf8_verifychar(s, len);
+			if (l == -1)
 				break;
 		}
-		else
-			/* invalid byte */
-			break;
-
 		s += l;
 		len -= l;
 	}

Re: [POC] verifying UTF-8 using SIMD instructions

Reply via email to