https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120264

            Bug ID: 120264
           Summary: Optimize modulus when the divisor is a small Mersenne
                    number.
           Product: gcc
           Version: 15.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: cassio.neri at gmail dot com
  Target Milestone: ---

This report is motivated by a discussion on getting the day of the week in
libstdc++:
https://gcc.gnu.org/pipermail/gcc-patches/2025-May/682664.html

In many platforms and under -O2, gcc calculates n % D by evaluating:
  n - D * (M * n >> S)    (built-in algorithm),
where M and S are constants (depending on D only).

If D is a Mersenne number, i.e. D = 2^p - 1 for some p > 0, there's an
alternative algorithm that gets n % D by evaluating:
  n * M' >> S'             (fast_mod algorithm).

Here is an implementation for uint32_t operands.

template <uint32_t D>
uint32_t fast_mod(uint32_t n) {
  // Since 2^64 % D > 0, 2^64 / D == (2^64 - 1) / D;
  uint64_t constexpr M = uint64_t(-1) / D + 1; // == 2^64 / D + 1
  uint32_t constexpr S = 64 - std::popcount(D);
  return uint32_t(M * uint64_t(n) >> S);
}

The case D = 7 is relevant for calculating the day of the week. In x86-64,
codegen for for built-in and fast_mod are, respectively,

unsigned int mod<7u>(unsigned int):
  mov edx, edi
  mov eax, edi
  imul rdx, rdx, 613566757
  shr rdx, 32
  sub eax, edx
  shr eax
  add eax, edx
  shr eax, 2
  lea edx, [0+rax*8]
  sub edx, eax
  mov eax, edi
  sub eax, edx
  ret

unsigned int fast_mod<7u>(unsigned int):
  movabs rdx, 2635249153387078803
  mov eax, edi
  imul rax, rdx
  shr rax, 61
  ret

(More comparisons: https://godbolt.org/z/xaTTKoxen and
https://godbolt.org/z/KKxoPYoME)

The program below takes ~1.5min to run and shows that:

  1. For D <= 2^16 - 1 =  65535, fast_mod<D> gives the right result for all n
in [0, 2^32[.
  2. For D >= 2^17 - 1 = 131071, fast_mod<D> gives the right result for all n
in [0, N[, where N < 2^32 is printed on the screen.

#include <bit>
#include <cstdint>
#include <iostream>

template <uint32_t D>
uint32_t fast_mod(uint32_t n) {
  // Since 2^64 % D > 0, 2^64 / D == (2^64 - 1) / D;
  uint64_t constexpr M = uint64_t(-1) / D + 1; // == 2^64 / D + 1
  uint32_t constexpr S = 64 - std::popcount(D);
  return uint32_t(M * uint64_t(n) >> S);
}

template <uint32_t D = 3>
void test() {

    std::cout << "Test for D = " << D << ": ";

    uint32_t n = 0;
    do {
        if (fast_mod<D>(n) != n % D) {
            std:: cout << "fail for n = " << n << ".\n";
            break;
        }
        ++n;
    } while (n != 0);

    if (n == 0)
        std::cout << "pass.\n";

    if constexpr (D != uint32_t(-1))
      test<2 * D + 1>();
}

int main() {
  test();
  return 0;
}

Therefore, for all D <= 65535, in many systems gcc could use the fast_mod
algorithm instead of the built-in one.

Actually, codegen can get better if, instead of 64-bits, fast_mod uses 32-bit
operations:

template <uint32_t D>
uint32_t fast_mod32(uint32_t n) {
  uint32_t constexpr M = uint32_t(-1) / D + 1;
  uint32_t constexpr S = 32 - std::popcount(D);
  return uint32_t(M * uint32_t(n) >> S);
}

For D == 7 we get:

unsigned int fast_mod32<7u>(unsigned int):
  imul eax, edi, 613566757
  shr eax, 29
  ret

Unfortunately, fast_mod32 has limited range (e.g. D = 7 it works on [0,
178956973[.)

My second suggestion is using fast_mod32 provided that the user tells the
compiler, through an [[assume]] attribute, that the dividend is within range:

  [[assume(n < 178956973)]];
  uint32_t r = n % 7; // gcc could use fast_mod32 here.

I reckon that this might be too niche and not worth doing for all divisors.
However, the day of the week calculation is a strong use case for implementing
fast_mod32 for D = 7. Alternatively, there could be a __builtin_fast_mod7 for
usage in libstdc++.

(Related: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118072)

Reply via email to