On 2026-03-13 13:40, Bruno Haible wrote:
your simplified code does *NOT* generate the same code. If you came to this impression, you probably must have tested x86_64, x86, arm64 only.
Yes, I did only x86-64 to be honest. Ouch: I am surprised RISC-V is so poorly optimized. alpha and sparc64 are of course lower priority, but if we need to do RISC-V anyway we might as well do them too.
I attempted to do this by installing the attached patch. This restores the optimization by using 'if' rather than '#if' as typically 'if' is better than '#ifdef', when either will do.
These patches also remove casts that aren't needed (some of which confused me a bit).The casts to signed intN_t types were there for clarity.
For me they made the code less clear. For example, there's no point to casting int_least64_t to uint64_t just before converting to uint_least64_t, as uint64_t cannot possibly have fewer bits than uint_least64_t. So I was puzzled as to why that cast was there.
Also, this part of stdbit.h is specified without exact-width types. So it's disconcerting to see unnecessary use of exact-width types in the Gnulib implementation; it's not clear why exact-width types are helpful in code where the API stays away from them. (C2y allows the exact-width types to not exist at all; of course that's mostly theoretical for Gnulib.)
The casts to uint_fast16_t were there for speed. On some architectures, such as sparc, it is more efficient to work with 32-bit integers than with 16-bit integers, and the definition of uint_fast16_t as uint32_t embodies this knowledge. Removing these casts is also a de-optimization that no one has asked for.
I don't understand the deoptimization there, as on all targets of GNU the values are promoted to at least 32 bits even without the casts. But it's not a big deal so I put that back in too, in the attached patch. It makes the code a bit more symmetric even if it is a bit more casty.
From b70f85852a7478ef7999967dcc3e48507da98668 Mon Sep 17 00:00:00 2001 From: Paul Eggert <[email protected]> Date: Fri, 13 Mar 2026 17:44:06 -0700 Subject: [PATCH] stdbit-h: restore optimization for RISC-V etc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Problem reported by Bruno Haible in: https://lists.gnu.org/r/bug-gnulib/2026-03/msg00098.html This patch does the optimization in a different way, preferring plain ‘if’ to ‘#if’ if either will do. * lib/stdbit.in.h: Bring back includes for byteswap.h, string.h. (_GL_STDBIT_OPTIMIZE_VIA_MEMCPY, _GL_STDBIT_ASSUME_ALIGNED) (_GL_STDBIT_BIGENDIAN): New macros. (stdc_load8_beu16, stdc_load8_leu16): Bring back casts to uint_fast16_t. (stdc_load8_aligned_beu16, stdc_load8_aligned_beu32) (stdc_load8_aligned_beu64, stdc_load8_aligned_leu16) (stdc_load8_aligned_leu32, stdc_load8_aligned_leu64) (stdc_load8_aligned_bes8, stdc_load8_aligned_bes16) (stdc_load8_aligned_bes32, stdc_load8_aligned_bes64) (stdc_load8_aligned_les8, stdc_load8_aligned_les16) (stdc_load8_aligned_les32, stdc_load8_aligned_les64) (stdc_store8_aligned_beu16, stdc_store8_aligned_beu32) (stdc_store8_aligned_beu64, stdc_store8_aligned_leu16) (stdc_store8_aligned_leu32, stdc_store8_aligned_leu64) (stdc_store8_aligned_bes8, stdc_store8_aligned_bes16) (stdc_store8_aligned_bes32, stdc_store8_aligned_bes64) (stdc_store8_aligned_les8, stdc_store8_aligned_les16) (stdc_store8_aligned_les32, stdc_store8_aligned_les64): Bring back the memcpy optimization if _GL_STDBIT_OPTIMIZE_VIA_MEMCPY. * modules/stdc_load8_aligned (Depends-on): * modules/stdc_store8_aligned (Depends-on): Go back to depending on byteswap. --- ChangeLog | 29 ++++ lib/stdbit.in.h | 288 ++++++++++++++++++++++++++++++++---- modules/stdc_load8_aligned | 1 + modules/stdc_store8_aligned | 1 + 4 files changed, 289 insertions(+), 30 deletions(-) diff --git a/ChangeLog b/ChangeLog index 8d1e8c2829..3caee4a011 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,5 +1,34 @@ 2026-03-13 Paul Eggert <[email protected]> + stdbit-h: restore optimization for RISC-V etc + Problem reported by Bruno Haible in: + https://lists.gnu.org/r/bug-gnulib/2026-03/msg00098.html + This patch does the optimization in a different way, + preferring plain ‘if’ to ‘#if’ if either will do. + * lib/stdbit.in.h: Bring back includes for byteswap.h, string.h. + (_GL_STDBIT_OPTIMIZE_VIA_MEMCPY, _GL_STDBIT_ASSUME_ALIGNED) + (_GL_STDBIT_BIGENDIAN): New macros. + (stdc_load8_beu16, stdc_load8_leu16): + Bring back casts to uint_fast16_t. + (stdc_load8_aligned_beu16, stdc_load8_aligned_beu32) + (stdc_load8_aligned_beu64, stdc_load8_aligned_leu16) + (stdc_load8_aligned_leu32, stdc_load8_aligned_leu64) + (stdc_load8_aligned_bes8, stdc_load8_aligned_bes16) + (stdc_load8_aligned_bes32, stdc_load8_aligned_bes64) + (stdc_load8_aligned_les8, stdc_load8_aligned_les16) + (stdc_load8_aligned_les32, stdc_load8_aligned_les64) + (stdc_store8_aligned_beu16, stdc_store8_aligned_beu32) + (stdc_store8_aligned_beu64, stdc_store8_aligned_leu16) + (stdc_store8_aligned_leu32, stdc_store8_aligned_leu64) + (stdc_store8_aligned_bes8, stdc_store8_aligned_bes16) + (stdc_store8_aligned_bes32, stdc_store8_aligned_bes64) + (stdc_store8_aligned_les8, stdc_store8_aligned_les16) + (stdc_store8_aligned_les32, stdc_store8_aligned_les64): + Bring back the memcpy optimization if _GL_STDBIT_OPTIMIZE_VIA_MEMCPY. + * modules/stdc_load8_aligned (Depends-on): + * modules/stdc_store8_aligned (Depends-on): + Go back to depending on byteswap. + stdbit-h: fewer casts * lib/stdbit.in.h (stdc_load8_beu16, stdc_load8_leu16) (stdc_store8_beu16, stdc_store8_beu32, stdc_store8_beu64) diff --git a/lib/stdbit.in.h b/lib/stdbit.in.h index 63c1db25ad..893e1cce71 100644 --- a/lib/stdbit.in.h +++ b/lib/stdbit.in.h @@ -45,6 +45,16 @@ #endif +#if @GNULIB_STDC_LOAD8_ALIGNED@ || @GNULIB_STDC_STORE8_ALIGNED@ + +/* Get bswap_16, bswap_32, bswap_64. */ +# include <byteswap.h> + +/* Get memcpy. */ +# include <string.h> + +#endif + _GL_INLINE_HEADER_BEGIN #ifndef _GL_STDBIT_INLINE @@ -1108,6 +1118,128 @@ stdc_bit_ceil_ull (unsigned long long int n) /* ISO C2y § 7.18.21 Endian-Aware 8-Bit Load */ +/* On hosts where _GL_STDBIT_OPTIMIZE_VIA_MEMCPY might be useful, + we need to avoid type-punning, because the compiler's aliasing + analysis would frequently produce incorrect code, and requiring the + option '-fno-strict-aliasing' is no viable solution. + So, this definition won't work: + + uint16_t + load16 (const unsigned char ptr[2]) + { + return *(const uint16_t *)ptr; + } + + Instead, the following definitions are candidates: + + // Trick from Lasse Collin: use memcpy and __builtin_assume_aligned. + uint16_t + load16_a (const unsigned char ptr[2]) + { + uint16_t value; + memcpy (&value, __builtin_assume_aligned (ptr, 2), 2); + return value; + } + + // Use __builtin_assume_aligned, without memcpy. + uint16_t + load16_b (const unsigned char ptr[2]) + { + const unsigned char *aptr = + (const unsigned char *) __builtin_assume_aligned (ptr, 2); + #if WORDS_BIGENDIAN + return ((uint16_t) aptr [0] << 8) | (uint16_t) aptr [1]; + #else + return (uint16_t) aptr [0] | ((uint16_t) aptr [1] << 8); + #endif + } + + // Use memcpy and __assume. + uint16_t + load16_c (const unsigned char ptr[2]) + { + __assume (((uintptr_t) ptr & (2 - 1)) == 0); + uint16_t value; + memcpy (&value, __builtin_assume_aligned (ptr, 2), 2); + return value; + } + + // Use __assume, without memcpy. + uint16_t + load16_d (const unsigned char ptr[2]) + { + __assume (((uintptr_t) ptr & (2 - 1)) == 0); + #if WORDS_BIGENDIAN + return ((uint16_t) ptr [0] << 8) | (uint16_t) ptr [1]; + #else + return (uint16_t) ptr [0] | ((uint16_t) ptr [1] << 8); + #endif + } + + // Use memcpy, without __builtin_assume_aligned or __assume. + uint16_t + load16_e (const unsigned char ptr[2]) + { + uint16_t value; + memcpy (&value, ptr, 2); + return value; + } + + // Use the code for the unaligned case. + uint16_t + load16_f (const unsigned char ptr[2]) + { + #if WORDS_BIGENDIAN + return ((uint16_t) ptr [0] << 8) | (uint16_t) ptr [1]; + #else + return (uint16_t) ptr [0] | ((uint16_t) ptr [1] << 8); + #endif + } + + Portability constraints: + - __builtin_assume_aligned works only in GCC >= 4.7 and clang >= 4. + - __assume works only with MSVC (_MSC_VER >= 1200). + + Which variant produces the best code? + - memcpy is inlined only in gcc >= 3.4, g++ >= 4.9, clang >= 4. + - MSVC's __assume has no effect. + - With gcc 13: + On armelhf, arm64, i686, powerpc, powerpc64, powerpc64le, s390x, x86_64: + All of a,b,e,f are equally good. + On alpha, arm, hppa, mips, mips64, riscv64, sh4, sparc64: + Only a,b are good; f medium; e worst. + - With older gcc versions on x86_64: + gcc >= 10: All of a,b,e,f are equally good. + gcc < 10: Only a,e are good; b,f medium. + - With MSVC 14: Only c,e are good; d,f medium. + + So, we use the following heuristic for getting good code: + - gcc >= 4.7, g++ >= 4.9, clang >= 4: Use variant a. + - MSVC: Use variant e. + - Otherwise: Use variant f. + */ +#if (defined __clang__ ? __clang_major__ >= 4 : \ + (defined __GNUC__ \ + && (defined __cplusplus \ + ? __GNUC__ + (__GNUC_MINOR__ >= 9) > 4 \ + : __GNUC__ + (__GNUC_MINOR__ >= 7) > 4))) +# define _GL_STDBIT_ASSUME_ALIGNED(ptr, align) \ + __builtin_assume_aligned (ptr, align) +# define _GL_STDBIT_OPTIMIZE_VIA_MEMCPY 1 +#elif defined _MSC_VER +# define _GL_STDBIT_OPTIMIZE_VIA_MEMCPY 1 +#endif + +#ifndef _GL_STDBIT_ASSUME_ALIGNED +# define _GL_STDBIT_ASSUME_ALIGNED(ptr, align) (ptr) +#endif + +#ifndef _GL_STDBIT_OPTIMIZE_VIA_MEMCPY +# define _GL_STDBIT_OPTIMIZE_VIA_MEMCPY 0 +#endif + +#define _GL_STDBIT_BIGENDIAN (__STDC_ENDIAN_NATIVE__ == __STDC_ENDIAN_BIG__) + #if @GNULIB_STDC_LOAD8@ _GL_STDC_LOAD8_INLINE uint_least8_t @@ -1119,7 +1251,7 @@ stdc_load8_beu8 (const unsigned char ptr[1]) _GL_STDC_LOAD8_INLINE uint_least16_t stdc_load8_beu16 (const unsigned char ptr[2]) { - return (ptr[0] << 8) | ptr[1]; + return ((uint_fast16_t) ptr[0] << 8) | (uint_fast16_t) ptr[1]; } _GL_STDC_LOAD8_INLINE uint_least32_t @@ -1147,7 +1279,7 @@ stdc_load8_leu8 (const unsigned char ptr[1]) _GL_STDC_LOAD8_INLINE uint_least16_t stdc_load8_leu16 (const unsigned char ptr[2]) { - return ptr[0] | (ptr[1] << 8); + return (uint_fast16_t) ptr[0] | ((uint_fast16_t) ptr[1] << 8); } _GL_STDC_LOAD8_INLINE uint_least32_t @@ -1227,19 +1359,46 @@ stdc_load8_aligned_beu8 (const unsigned char ptr[1]) _GL_STDC_LOAD8_ALIGNED_INLINE uint_least16_t stdc_load8_aligned_beu16 (const unsigned char ptr[2]) { - return stdc_load8_beu16 (ptr); + if (_GL_STDBIT_OPTIMIZE_VIA_MEMCPY) + { + uint_least16_t value; + memcpy (&value, _GL_STDBIT_ASSUME_ALIGNED (ptr, 2), 2); + if (!_GL_STDBIT_BIGENDIAN) + value = bswap_16 (value); + return value; + } + else + return stdc_load8_beu16 (ptr); } _GL_STDC_LOAD8_ALIGNED_INLINE uint_least32_t stdc_load8_aligned_beu32 (const unsigned char ptr[4]) { - return stdc_load8_beu32 (ptr); + if (_GL_STDBIT_OPTIMIZE_VIA_MEMCPY) + { + uint_least32_t value; + memcpy (&value, _GL_STDBIT_ASSUME_ALIGNED (ptr, 4), 4); + if (!_GL_STDBIT_BIGENDIAN) + value = bswap_32 (value); + return value; + } + else + return stdc_load8_beu32 (ptr); } _GL_STDC_LOAD8_ALIGNED_INLINE uint_least64_t stdc_load8_aligned_beu64 (const unsigned char ptr[8]) { - return stdc_load8_beu64 (ptr); + if (_GL_STDBIT_OPTIMIZE_VIA_MEMCPY) + { + uint_least64_t value; + memcpy (&value, _GL_STDBIT_ASSUME_ALIGNED (ptr, 8), 8); + if (!_GL_STDBIT_BIGENDIAN) + value = bswap_64 (value); + return value; + } + else + return stdc_load8_beu64 (ptr); } _GL_STDC_LOAD8_ALIGNED_INLINE uint_least8_t @@ -1251,67 +1410,94 @@ stdc_load8_aligned_leu8 (const unsigned char ptr[1]) _GL_STDC_LOAD8_ALIGNED_INLINE uint_least16_t stdc_load8_aligned_leu16 (const unsigned char ptr[2]) { - return stdc_load8_leu16 (ptr); + if (_GL_STDBIT_OPTIMIZE_VIA_MEMCPY) + { + uint_least16_t value; + memcpy (&value, _GL_STDBIT_ASSUME_ALIGNED (ptr, 2), 2); + if (_GL_STDBIT_BIGENDIAN) + value = bswap_16 (value); + return value; + } + else + return stdc_load8_leu16 (ptr); } _GL_STDC_LOAD8_ALIGNED_INLINE uint_least32_t stdc_load8_aligned_leu32 (const unsigned char ptr[4]) { - return stdc_load8_leu32 (ptr); + if (_GL_STDBIT_OPTIMIZE_VIA_MEMCPY) + { + uint_least32_t value; + memcpy (&value, _GL_STDBIT_ASSUME_ALIGNED (ptr, 4), 4); + if (_GL_STDBIT_BIGENDIAN) + value = bswap_32 (value); + return value; + } + else + return stdc_load8_leu32 (ptr); } _GL_STDC_LOAD8_ALIGNED_INLINE uint_least64_t stdc_load8_aligned_leu64 (const unsigned char ptr[8]) { - return stdc_load8_leu64 (ptr); + if (_GL_STDBIT_OPTIMIZE_VIA_MEMCPY) + { + uint_least64_t value; + memcpy (&value, _GL_STDBIT_ASSUME_ALIGNED (ptr, 8), 8); + if (_GL_STDBIT_BIGENDIAN) + value = bswap_64 (value); + return value; + } + else + return stdc_load8_leu64 (ptr); } _GL_STDC_LOAD8_ALIGNED_INLINE int_least8_t stdc_load8_aligned_bes8 (const unsigned char ptr[1]) { - return stdc_load8_bes8 (ptr); + return stdc_load8_aligned_beu8 (ptr); } _GL_STDC_LOAD8_ALIGNED_INLINE int_least16_t stdc_load8_aligned_bes16 (const unsigned char ptr[2]) { - return stdc_load8_bes16 (ptr); + return stdc_load8_aligned_beu16 (ptr); } _GL_STDC_LOAD8_ALIGNED_INLINE int_least32_t stdc_load8_aligned_bes32 (const unsigned char ptr[4]) { - return stdc_load8_bes32 (ptr); + return stdc_load8_aligned_beu32 (ptr); } _GL_STDC_LOAD8_ALIGNED_INLINE int_least64_t stdc_load8_aligned_bes64 (const unsigned char ptr[8]) { - return stdc_load8_bes64 (ptr); + return stdc_load8_aligned_beu64 (ptr); } _GL_STDC_LOAD8_ALIGNED_INLINE int_least8_t stdc_load8_aligned_les8 (const unsigned char ptr[1]) { - return stdc_load8_les8 (ptr); + return stdc_load8_aligned_leu8 (ptr); } _GL_STDC_LOAD8_ALIGNED_INLINE int_least16_t stdc_load8_aligned_les16 (const unsigned char ptr[2]) { - return stdc_load8_les16 (ptr); + return stdc_load8_aligned_leu16 (ptr); } _GL_STDC_LOAD8_ALIGNED_INLINE int_least32_t stdc_load8_aligned_les32 (const unsigned char ptr[4]) { - return stdc_load8_les32 (ptr); + return stdc_load8_aligned_leu32 (ptr); } _GL_STDC_LOAD8_ALIGNED_INLINE int_least64_t stdc_load8_aligned_les64 (const unsigned char ptr[8]) { - return stdc_load8_les64 (ptr); + return stdc_load8_aligned_leu64 (ptr); } #endif @@ -1452,19 +1638,40 @@ stdc_store8_aligned_beu8 (uint_least8_t value, unsigned char ptr[1]) _GL_STDC_STORE8_ALIGNED_INLINE void stdc_store8_aligned_beu16 (uint_least16_t value, unsigned char ptr[2]) { - stdc_store8_beu16 (value, ptr); + if (_GL_STDBIT_OPTIMIZE_VIA_MEMCPY) + { + if (!_GL_STDBIT_BIGENDIAN) + value = bswap_16 (value); + memcpy (_GL_STDBIT_ASSUME_ALIGNED (ptr, 2), &value, 2); + } + else + stdc_store8_beu16 (value, ptr); } _GL_STDC_STORE8_ALIGNED_INLINE void stdc_store8_aligned_beu32 (uint_least32_t value, unsigned char ptr[4]) { - stdc_store8_beu32 (value, ptr); + if (_GL_STDBIT_OPTIMIZE_VIA_MEMCPY) + { + if (!_GL_STDBIT_BIGENDIAN) + value = bswap_32 (value); + memcpy (_GL_STDBIT_ASSUME_ALIGNED (ptr, 4), &value, 4); + } + else + stdc_store8_beu32 (value, ptr); } _GL_STDC_STORE8_ALIGNED_INLINE void stdc_store8_aligned_beu64 (uint_least64_t value, unsigned char ptr[8]) { - stdc_store8_beu64 (value, ptr); + if (_GL_STDBIT_OPTIMIZE_VIA_MEMCPY) + { + if (!_GL_STDBIT_BIGENDIAN) + value = bswap_64 (value); + memcpy (_GL_STDBIT_ASSUME_ALIGNED (ptr, 8), &value, 8); + } + else + stdc_store8_beu64 (value, ptr); } _GL_STDC_STORE8_ALIGNED_INLINE void @@ -1476,67 +1683,88 @@ stdc_store8_aligned_leu8 (uint_least8_t value, unsigned char ptr[1]) _GL_STDC_STORE8_ALIGNED_INLINE void stdc_store8_aligned_leu16 (uint_least16_t value, unsigned char ptr[2]) { - stdc_store8_leu16 (value, ptr); + if (_GL_STDBIT_OPTIMIZE_VIA_MEMCPY) + { + if (_GL_STDBIT_BIGENDIAN) + value = bswap_16 (value); + memcpy (_GL_STDBIT_ASSUME_ALIGNED (ptr, 2), &value, 2); + } + else + stdc_store8_leu16 (value, ptr); } _GL_STDC_STORE8_ALIGNED_INLINE void stdc_store8_aligned_leu32 (uint_least32_t value, unsigned char ptr[4]) { - stdc_store8_leu32 (value, ptr); + if (_GL_STDBIT_OPTIMIZE_VIA_MEMCPY) + { + if (_GL_STDBIT_BIGENDIAN) + value = bswap_32 (value); + memcpy (_GL_STDBIT_ASSUME_ALIGNED (ptr, 4), &value, 4); + } + else + stdc_store8_leu32 (value, ptr); } _GL_STDC_STORE8_ALIGNED_INLINE void stdc_store8_aligned_leu64 (uint_least64_t value, unsigned char ptr[8]) { - stdc_store8_leu64 (value, ptr); + if (_GL_STDBIT_OPTIMIZE_VIA_MEMCPY) + { + if (_GL_STDBIT_BIGENDIAN) + value = bswap_64 (value); + memcpy (_GL_STDBIT_ASSUME_ALIGNED (ptr, 8), &value, 8); + } + else + stdc_store8_leu64 (value, ptr); } _GL_STDC_STORE8_ALIGNED_INLINE void stdc_store8_aligned_bes8 (int_least8_t value, unsigned char ptr[1]) { - stdc_store8_bes8 (value, ptr); + stdc_store8_aligned_beu8 (value, ptr); } _GL_STDC_STORE8_ALIGNED_INLINE void stdc_store8_aligned_bes16 (int_least16_t value, unsigned char ptr[2]) { - stdc_store8_bes16 (value, ptr); + stdc_store8_aligned_beu16 (value, ptr); } _GL_STDC_STORE8_ALIGNED_INLINE void stdc_store8_aligned_bes32 (int_least32_t value, unsigned char ptr[4]) { - stdc_store8_bes32 (value, ptr); + stdc_store8_aligned_beu32 (value, ptr); } _GL_STDC_STORE8_ALIGNED_INLINE void stdc_store8_aligned_bes64 (int_least64_t value, unsigned char ptr[8]) { - stdc_store8_bes64 (value, ptr); + stdc_store8_aligned_beu64 (value, ptr); } _GL_STDC_STORE8_ALIGNED_INLINE void stdc_store8_aligned_les8 (int_least8_t value, unsigned char ptr[1]) { - stdc_store8_les8 (value, ptr); + stdc_store8_aligned_leu8 (value, ptr); } _GL_STDC_STORE8_ALIGNED_INLINE void stdc_store8_aligned_les16 (int_least16_t value, unsigned char ptr[2]) { - stdc_store8_les16 (value, ptr); + stdc_store8_aligned_leu16 (value, ptr); } _GL_STDC_STORE8_ALIGNED_INLINE void stdc_store8_aligned_les32 (int_least32_t value, unsigned char ptr[4]) { - stdc_store8_les32 (value, ptr); + stdc_store8_aligned_leu32 (value, ptr); } _GL_STDC_STORE8_ALIGNED_INLINE void stdc_store8_aligned_les64 (int_least64_t value, unsigned char ptr[8]) { - stdc_store8_les64 (value, ptr); + stdc_store8_aligned_leu64 (value, ptr); } #endif diff --git a/modules/stdc_load8_aligned b/modules/stdc_load8_aligned index 8539f55520..6a9cca24c1 100644 --- a/modules/stdc_load8_aligned +++ b/modules/stdc_load8_aligned @@ -8,6 +8,7 @@ Depends-on: stdbit-h stdc_load8 stdint-h +byteswap configure.ac: AC_REQUIRE([gl_STDBIT_H]) diff --git a/modules/stdc_store8_aligned b/modules/stdc_store8_aligned index f23556e504..f80625afea 100644 --- a/modules/stdc_store8_aligned +++ b/modules/stdc_store8_aligned @@ -8,6 +8,7 @@ Depends-on: stdbit-h stdc_store8 stdint-h +byteswap configure.ac: AC_REQUIRE([gl_STDBIT_H]) -- 2.51.0
