Hi,

On Wed, Nov 06, 2024 at 12:16:33PM +1300, David Rowley wrote:
> On Wed, 6 Nov 2024 at 04:03, Bertrand Drouvot
> <bertranddrouvot...@gmail.com> wrote:
> > Another option could be to use SIMD instructions to check multiple bytes
> > is zero in a single operation. Maybe just an idea to keep in mind and 
> > experiment
> > if we feel the need later on.
> 
> Could do. I just wrote it that way to give the compiler flexibility to
> do SIMD implicitly.

ohhh, great, thanks!

> That seemed easier than messing around with SIMD
> intrinsics.

I had in mind to use SIMD intrinsics actually when posting the SIMD idea but...

> I guess the compiler won't use SIMD with the single
> size_t-at-a-time version as it can't be certain it's ok to access the
> memory beyond the first zero word. Because I wrote the "if" condition
> using bitwise-OR, there's no boolean short-circuiting, so the compiler
> sees it must be safe to access all the memory for the loop iteration.

that's a better idea! Yeah, I think that now the compiler sees that all 
comparisons
can be done in parallel and combined with a single OR operation (so, good 
candidate
to use SIMD optimization).

> If I use -march=native or -march=znver2 on my Zen2 machine, gcc does
> use SIMD operators.  Clang uses some 128-bit registers without
> specifying -march:
> 
> drowley@amd3990x:~$ gcc -O2 allzeros.c -march=native -o allzeros &&
> for i in {1..3}; do ./allzeros; done
> char: done in 1940539 nanoseconds
> size_t: done in 261731 nanoseconds (7.41425 times faster than char)
> size_t * 4: done in 130415 nanoseconds (14.8797 times faster than char)
> size_t * 8: done in 70031 nanoseconds (27.7097 times faster than char)
> char: done in 3030132 nanoseconds
> size_t: done in 477044 nanoseconds (6.35189 times faster than char)
> size_t * 4: done in 123551 nanoseconds (24.5254 times faster than char)
> size_t * 8: done in 68549 nanoseconds (44.2039 times faster than char)
> char: done in 3214037 nanoseconds
> size_t: done in 256901 nanoseconds (12.5108 times faster than char)
> size_t * 4: done in 126017 nanoseconds (25.5048 times faster than char)
> size_t * 8: done in 73167 nanoseconds (43.9274 times faster than char)
> 

Thanks for the tests! Out of curiosity, using gcc 11.4.0 (SIMD instructions not
generated) and get:

$ gcc -O2 allzeros_simd.c -o allzeros_simd ; ./allzeros_simd
char: done in 2655385 nanoseconds
size_t: done in 476021 nanoseconds (5.57829 times faster than char)
size_t SIMD DAVID: done in 174816 nanoseconds (15.1896 times faster than char)

or

$ gcc -march=native -O2 allzeros_simd.c -o allzeros_simd ; ./allzeros_simd
char: done in 2681146 nanoseconds
size_t: done in 395041 nanoseconds (6.78701 times faster than char)
size_t SIMD DAVID: done in 175608 nanoseconds (15.2678 times faster than char)

=> It's faster than the size_t one.

But of course, it's even faster with SIMD:

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -O2 allzeros_simd.c -o allzeros_simd ; 
./allzeros_simd
char: done in 5318674 nanoseconds
size_t: done in 443591 nanoseconds (11.99 times faster than char)
size_t SIMD DAVID: done in 179650 nanoseconds (29.6058 times faster than char)

or

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_simd.c -o 
allzeros_simd ; ./allzeros_simd
char: done in 5319534 nanoseconds
size_t: done in 426599 nanoseconds (12.4696 times faster than char)
size_t SIMD DAVID: done in 128687 nanoseconds (41.337 times faster than char)

So, I don't see any reason why not to use this SIMD approach: please find v7
attached.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From c609eeb1ee333018a107b15f9c3a1a699a9f661d Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot...@gmail.com>
Date: Fri, 1 Nov 2024 11:46:29 +0000
Subject: [PATCH v7] Optimize pg_memory_is_all_zeros()

pg_memory_is_all_zeros() is currently doing byte per byte comparison and so
could lead to performance regression or penalties when multi bytes comparison
could be done instead.

Let's provide an optimized version that divides the checks into three phases for
efficiency:

- Initial alignment (byte per byte comparison)
- Multiple bytes comparison at once (candidate for SIMD optimization)
- Remaining bytes (byte per byte comparison)

Code mainly provided by David Rowley.

In passing, let's make use of this new version in PageIsVerifiedExtended() now
that multi bytes comparison is handled.
---
 src/backend/storage/page/bufpage.c | 13 +--------
 src/include/utils/memutils.h       | 43 +++++++++++++++++++++++++++---
 2 files changed, 41 insertions(+), 15 deletions(-)
  15.1% src/backend/storage/page/
  84.8% src/include/utils/

diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index be6f1f62d2..aa264f61b9 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -89,10 +89,8 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags)
 {
 	PageHeader	p = (PageHeader) page;
 	size_t	   *pagebytes;
-	int			i;
 	bool		checksum_failure = false;
 	bool		header_sane = false;
-	bool		all_zeroes = false;
 	uint16		checksum = 0;
 
 	/*
@@ -126,18 +124,9 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags)
 	}
 
 	/* Check all-zeroes case */
-	all_zeroes = true;
 	pagebytes = (size_t *) page;
-	for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++)
-	{
-		if (pagebytes[i] != 0)
-		{
-			all_zeroes = false;
-			break;
-		}
-	}
 
-	if (all_zeroes)
+	if (pg_memory_is_all_zeros(pagebytes, BLCKSZ))
 		return true;
 
 	/*
diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
index 3590c8bad9..8bce3c6141 100644
--- a/src/include/utils/memutils.h
+++ b/src/include/utils/memutils.h
@@ -192,17 +192,54 @@ extern MemoryContext BumpContextCreate(MemoryContext parent,
 /*
  * Test if a memory region starting at "ptr" and of size "len" is full of
  * zeroes.
+ *
+ * The test is divided into three phases for efficiency:
+ *  - Initial alignment (byte per byte comparison)
+ *  - Multiple bytes comparison at once
+ *  - Remaining bytes (byte per byte comparison)
+ *
+ * Caller must ensure that "ptr" is non-null.
  */
 static inline bool
 pg_memory_is_all_zeros(const void *ptr, size_t len)
 {
-	const char *p = (const char *) ptr;
+	const unsigned char *p = (const unsigned char *) ptr;
+	const unsigned char *end = &p[len];
+	const unsigned char *aligned_end = (const unsigned char *) ((uintptr_t) end & (~(sizeof(size_t) - 1)));
+
+	/* Compare bytes, byte by byte, until the pointer "p" is aligned */
+	while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
+	{
+		if (p == end)
+			return true;
+
+		if (*p++ != 0)
+			return false;
+	}
+
+	/*
+	 * Multiple bytes comparison(s) at once.
+	 *
+	 * The logic here is that all comparisons can be done in parallel and
+	 * combined with a single OR operation, making it a good candidate for
+	 * SIMD optimization.
+	 */
+	for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8)
+	{
+		if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) |
+			(((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) |
+			(((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) |
+			(((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0))
+			return false;
+	}
 
-	for (size_t i = 0; i < len; i++)
+	/* Compare remaining bytes, byte by byte */
+	while (p < end)
 	{
-		if (p[i] != 0)
+		if (*p++ != 0)
 			return false;
 	}
+
 	return true;
 }
 
-- 
2.34.1

Reply via email to