https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120264
Bug ID: 120264 Summary: Optimize modulus when the divisor is a small Mersenne number. Product: gcc Version: 15.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: cassio.neri at gmail dot com Target Milestone: --- This report is motivated by a discussion on getting the day of the week in libstdc++: https://gcc.gnu.org/pipermail/gcc-patches/2025-May/682664.html In many platforms and under -O2, gcc calculates n % D by evaluating: n - D * (M * n >> S) (built-in algorithm), where M and S are constants (depending on D only). If D is a Mersenne number, i.e. D = 2^p - 1 for some p > 0, there's an alternative algorithm that gets n % D by evaluating: n * M' >> S' (fast_mod algorithm). Here is an implementation for uint32_t operands. template <uint32_t D> uint32_t fast_mod(uint32_t n) { // Since 2^64 % D > 0, 2^64 / D == (2^64 - 1) / D; uint64_t constexpr M = uint64_t(-1) / D + 1; // == 2^64 / D + 1 uint32_t constexpr S = 64 - std::popcount(D); return uint32_t(M * uint64_t(n) >> S); } The case D = 7 is relevant for calculating the day of the week. In x86-64, codegen for for built-in and fast_mod are, respectively, unsigned int mod<7u>(unsigned int): mov edx, edi mov eax, edi imul rdx, rdx, 613566757 shr rdx, 32 sub eax, edx shr eax add eax, edx shr eax, 2 lea edx, [0+rax*8] sub edx, eax mov eax, edi sub eax, edx ret unsigned int fast_mod<7u>(unsigned int): movabs rdx, 2635249153387078803 mov eax, edi imul rax, rdx shr rax, 61 ret (More comparisons: https://godbolt.org/z/xaTTKoxen and https://godbolt.org/z/KKxoPYoME) The program below takes ~1.5min to run and shows that: 1. For D <= 2^16 - 1 = 65535, fast_mod<D> gives the right result for all n in [0, 2^32[. 2. For D >= 2^17 - 1 = 131071, fast_mod<D> gives the right result for all n in [0, N[, where N < 2^32 is printed on the screen. #include <bit> #include <cstdint> #include <iostream> template <uint32_t D> uint32_t fast_mod(uint32_t n) { // Since 2^64 % D > 0, 2^64 / D == (2^64 - 1) / D; uint64_t constexpr M = uint64_t(-1) / D + 1; // == 2^64 / D + 1 uint32_t constexpr S = 64 - std::popcount(D); return uint32_t(M * uint64_t(n) >> S); } template <uint32_t D = 3> void test() { std::cout << "Test for D = " << D << ": "; uint32_t n = 0; do { if (fast_mod<D>(n) != n % D) { std:: cout << "fail for n = " << n << ".\n"; break; } ++n; } while (n != 0); if (n == 0) std::cout << "pass.\n"; if constexpr (D != uint32_t(-1)) test<2 * D + 1>(); } int main() { test(); return 0; } Therefore, for all D <= 65535, in many systems gcc could use the fast_mod algorithm instead of the built-in one. Actually, codegen can get better if, instead of 64-bits, fast_mod uses 32-bit operations: template <uint32_t D> uint32_t fast_mod32(uint32_t n) { uint32_t constexpr M = uint32_t(-1) / D + 1; uint32_t constexpr S = 32 - std::popcount(D); return uint32_t(M * uint32_t(n) >> S); } For D == 7 we get: unsigned int fast_mod32<7u>(unsigned int): imul eax, edi, 613566757 shr eax, 29 ret Unfortunately, fast_mod32 has limited range (e.g. D = 7 it works on [0, 178956973[.) My second suggestion is using fast_mod32 provided that the user tells the compiler, through an [[assume]] attribute, that the dividend is within range: [[assume(n < 178956973)]]; uint32_t r = n % 7; // gcc could use fast_mod32 here. I reckon that this might be too niche and not worth doing for all divisors. However, the day of the week calculation is a strong use case for implementing fast_mod32 for D = 7. Alternatively, there could be a __builtin_fast_mod7 for usage in libstdc++. (Related: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118072)