The attached patch (which I haven't installed) simplifies stdc_memreverse8 a bit, and I imagine it might make it a tad faster in some cases on x86-64 with GCC 15, as the compiler generates one less conditional branch in the function prolog. Also though I doubt whether it matters, the tight loop has 2 fewer bytes of instructions.

Since the attached patch implements C2y almost word for word and the patched code is therefore a bit simpler to verify, is there some reason why it shouldn't be applied? Maybe it's really slow on some other platform? If so, a comment to that effect would be helpful.
From d2cc68d8bb5aed53ce4ac3babe14e5cb4810fb55 Mon Sep 17 00:00:00 2001
From: Paul Eggert <[email protected]>
Date: Mon, 16 Mar 2026 16:34:49 -0700
Subject: [PATCH] stdc_memreverse8: simplify and slightly tune
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* lib/stdbit.in.h (stdc_memreverse8):
Use code that closely mimics draft C2y.
On x86-64 with gcc 15 -O2, this generates slightly-better code
for the non-inlined version, and doesn’t seem to hurt inlining.
---
 ChangeLog       |  6 ++++++
 lib/stdbit.in.h | 24 ++++++++++--------------
 2 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 1032063da7..342176863c 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,11 @@
 2026-03-16  Paul Eggert  <[email protected]>
 
+	stdc_memreverse8: simplify and slightly tune
+	* lib/stdbit.in.h (stdc_memreverse8):
+	Use code that closely mimics draft C2y.
+	On x86-64 with gcc 15 -O2, this generates slightly-better code
+	for the non-inlined version, and doesn’t seem to hurt inlining.
+
 	stdbit-h: don’t generate some dummy .o files
 	On recent GNU and other C23ish platforms, do not compile files
 	like lib/stdc_bit_ceil.c, as the corresponding .o files contain
diff --git a/lib/stdbit.in.h b/lib/stdbit.in.h
index 8b61300041..b8c99d0247 100644
--- a/lib/stdbit.in.h
+++ b/lib/stdbit.in.h
@@ -1307,21 +1307,17 @@ stdc_rotate_right_ull (unsigned long long int v, unsigned int c)
 _GL_STDC_MEMREVERSE8_INLINE void
 stdc_memreverse8 (size_t n, unsigned char *ptr)
 {
-  if (n > 0)
+  /* There is no need to optimize the cases N == 1, N == 2, N == 4
+     specially using __builtin_constant_p, because GCC does the possible
+     optimizations already, taking into account the alignment of PTR:
+     GCC >= 3 for N == 1, GCC >= 8 for N == 2, GCC >= 13 for N == 4.
+     (Whereas clang >= 3, <= 22 optimizes only the case N == 1.)  */
+  for (size_t i = 0; i < n / 2; i++)
     {
-      /* There is no need to optimize the cases N == 1, N == 2, N == 4
-         specially using __builtin_constant_p, because GCC does the possible
-         optimizations already, taking into account the alignment of PTR:
-         GCC >= 3 for N == 1, GCC >= 8 for N == 2, GCC >= 13 for N == 4.
-         (Whereas clang >= 3, <= 22 optimizes only the case N == 1.)  */
-      size_t i, j;
-      for (i = 0, j = n-1; i < j; i++, j--)
-        {
-          unsigned char xi = ptr[i];
-          unsigned char xj = ptr[j];
-          ptr[j] = xi;
-          ptr[i] = xj;
-        }
+      unsigned char xi = ptr[i];
+      unsigned char xj = ptr[n - i - 1];
+      ptr[n - i - 1] = xi;
+      ptr[i] = xj;
     }
 }
 
-- 
2.51.0

Reply via email to