Hi, On Mon, Nov 11, 2024 at 06:19:50AM +0000, Bertrand Drouvot wrote: > Hi, > > On Sat, Nov 09, 2024 at 04:15:04AM +0000, Bertrand Drouvot wrote: > > Hi, > > > > On Fri, Nov 08, 2024 at 11:18:09PM +1300, David Rowley wrote: > > > I tried with [1] and the > > > latest gcc does not seem to be smart enough to figure this out. I > > > tried adding some additional len checks that the compiler can use as a > > > cue and won't need to emit code for the checks providing the function > > > does get inlined. That was enough to get the compiler to not emit the > > > loops when they'll not be used. See the -DCHECK_LEN flag I'm passing > > > in the 2nd compiler window. I just don't know if putting something > > > like that into the code is a good idea as if the function wasn't > > > inlined for some reason, the extra len checks would have to be > > > compiled into the function. > > > > > > David > > > > > > [1] https://godbolt.org/z/xa81ro8GK > > > > Looking at it, that looks like an issue. > > > > I mean, without the -DCHECK_LEN flag then the SIMD code will read up to 48 > > bytes > > beyond the struct's memory (which is 16 bytes): > > > > This is fine: > > " > > movdqu xmm0, XMMWORD PTR [rdi] > > " > > > > But I don't think it is: > > > > " > > movdqu xmm2, XMMWORD PTR [rdi+16] > > movdqu xmm1, XMMWORD PTR [rdi+32] > > movdqu xmm3, XMMWORD PTR [rdi+48] > > " > > > > given that the struct size is only 16 bytes. > > > > Thoughts? > > What about "simply" starting pg_memory_is_all_zeros() with? > > " > if (len < sizeof(size_t)*8) { > while (p < end) { > if (*p++ != 0) > return false; > } > return true; > } > " > > That way: > > - we prevents reading beyond the memory area in the SIMD section (if < 64 > bytes) > - we make sure that aligned_end can not be after the real end (could be if the > len is < 8 bytes) > - there is no need for additional size checks later in the code > - len < 64 bytes will be read byte per byte but that's likely "enough" (if not > faster) for those "small" sizes >
To handle the "what about the len check if the function is not inlined?", I can't think about a good approach. I thought about using a macro like: " #define pg_memory_is_all_zeros(ptr, len) \ ((len) < sizeof(size_t) * 8 ? \ pg_memory_is_all_zeros_small((ptr), (len)) : \ pg_memory_is_all_zeros_large((ptr), (len))) " but that would not help (as the len check would need to be done too, so same run time cost). I also thought about using __builtin_constant_p(len) but that would not help because still we need the len check for safety. So please find attached v11 that uses "only" one len check into the function to ensure that we won't read beyond the memory area (should its size be < 64 or < 8). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
>From 200e7752335d0f537f82a27ce8425fad8309a9fb Mon Sep 17 00:00:00 2001 From: Michael Paquier <mich...@paquier.xyz> Date: Fri, 8 Nov 2024 14:19:59 +0900 Subject: [PATCH v11 1/2] Optimize pg_memory_is_all_zeros() pg_memory_is_all_zeros() is currently doing byte per byte comparison and so could lead to performance regression or penalties when multi bytes comparison could be done instead. Let's provide an optimized version that divides the checks into four phases for efficiency: - Initial alignment (byte per byte comparison) - Compare 8 size_t chunks at once using bitwise OR (candidate for SIMD optimization) - Compare remaining size_t aligned chunks - Compare remaining bytes (byte per byte comparison) If the memory area size is < 64 bytes then we are using byte per byte comparison only to ensure that no data beyond the memory area could be read (that's likely to be good enough for such sizes). Code mainly suggested by David Rowley. --- src/include/utils/memutils.h | 85 ++++++++++++++++++++++++++++++++++-- 1 file changed, 82 insertions(+), 3 deletions(-) 100.0% src/include/utils/ diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h index 3590c8bad9..f10b0ea05e 100644 --- a/src/include/utils/memutils.h +++ b/src/include/utils/memutils.h @@ -190,19 +190,98 @@ extern MemoryContext BumpContextCreate(MemoryContext parent, #define SLAB_LARGE_BLOCK_SIZE (8 * 1024 * 1024) /* + * pg_memory_is_all_zeros + * * Test if a memory region starting at "ptr" and of size "len" is full of * zeroes. + * + * The test is divided into multiple phases, to be efficient for various + * length values: + * - Byte by byte comparison if len < 64 to ensure that we won't read beyond the + * memory area. + * - Byte by byte comparison, until the pointer is aligned. + * - 8 * sizeof(size_t) comparisons using bitwise OR, to encourage compilers + * to use SIMD instructions if available, up to the last aligned location + * possible. + * - size_t comparisons, with aligned pointers, up to the last location + * possible. + * - Byte by byte comparison, until the end location. + * + * Caller must ensure that "ptr" is not NULL. */ static inline bool pg_memory_is_all_zeros(const void *ptr, size_t len) { - const char *p = (const char *) ptr; + const unsigned char *p = (const unsigned char *) ptr; + const unsigned char *end = &p[len]; + const unsigned char *aligned_end = (const unsigned char *) + ((uintptr_t) end & (~(sizeof(size_t) - 1))); + + /* + * For len < 64, compare byte per byte to ensure we'll not read beyond the + * memory area. + */ + if (len < sizeof(size_t) * 8) + { + while (p < end) + { + if (*p++ != 0) + return false; + } + return true; + } + + /* Compare bytes until the pointer "p" is aligned */ + while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0) + { + if (p == end) + return true; + + if (*p++ != 0) + return false; + } + + /* + * Compare 8 * sizeof(size_t) chunks at once. + * + * For performance reasons, we manually unroll this loop and purposefully + * use bitwise-ORs to combine each comparison. This prevents boolean + * short-circuiting and lets the compiler know that it's safe to access + * all 8 elements regardless of the result of the other comparisons. This + * seems to be enough to coax a few compilers into using SIMD + * instructions. + * + * There is no risk to read beyond the memory area thanks to the len < 64 + * check done below. + */ + for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8) + { + if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) | + (((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) | + (((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) | + (((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0)) + return false; + } - for (size_t i = 0; i < len; i++) + /* + * Compare remaining size_t-aligned chunks. + * + * aligned_end cant' be > end as we ensured to take care of len < 8 (in + * the len < 64 check below). So, no risk to read beyond the memory area. + */ + for (; p < aligned_end; p += sizeof(size_t)) { - if (p[i] != 0) + if (*(size_t *) p != 0) return false; } + + /* Compare remaining bytes until the end */ + while (p < end) + { + if (*p++ != 0) + return false; + } + return true; } -- 2.34.1
>From 1bd0cda4709d08827d44423040a4ba607466532e Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot <bertranddrouvot...@gmail.com> Date: Fri, 8 Nov 2024 17:19:43 +0000 Subject: [PATCH v11 2/2] Make use of pg_memory_is_all_zeros() in PageIsVerifiedExtended() Now that pg_memory_is_all_zeros() is optimized to handle multi bytes comparison (instead of byte per byte), let's make use of it in PageIsVerifiedExtended(). --- src/backend/storage/page/bufpage.c | 13 +------------ 1 file changed, 1 insertion(+), 12 deletions(-) 100.0% src/backend/storage/page/ diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c index be6f1f62d2..aa264f61b9 100644 --- a/src/backend/storage/page/bufpage.c +++ b/src/backend/storage/page/bufpage.c @@ -89,10 +89,8 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags) { PageHeader p = (PageHeader) page; size_t *pagebytes; - int i; bool checksum_failure = false; bool header_sane = false; - bool all_zeroes = false; uint16 checksum = 0; /* @@ -126,18 +124,9 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags) } /* Check all-zeroes case */ - all_zeroes = true; pagebytes = (size_t *) page; - for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++) - { - if (pagebytes[i] != 0) - { - all_zeroes = false; - break; - } - } - if (all_zeroes) + if (pg_memory_is_all_zeros(pagebytes, BLCKSZ)) return true; /* -- 2.34.1