Hello Jakub,

thanks a lot for taking this on!

As far as I can tell, the patch works very well, and really speeds up
things.

As (somewhat confusingly) discussed in the PR, there are a couple of
things that could still be done incrementally with this method.

Fist, it can be combined with, or even used for, the calulation
of the quotient.  Let a be a number for which your patch works,
for example 5.

If you want to calculate n / 5, you can do

  rem = n % 5;  /* Fast, with your patch.  */
  quot = (n - rem) * magic;

in a fast way, where magic is the multiplicative inverse of 5
modulo 2^128 (for a 128-bit number) or 2^64 (for a 64-bit number),
which can be calculated using gmp_invert. The multiplicative inverse
for division works because n - rem is divisible by 5.

Second, you can also use this for the quotient and/or remainder
by 2*a (for example 10), with a slight adjustment:

  rem5 = n % 5;
  quot5 = (n - rem5) * magic;
  rem10 = (quot5 % 2) * 5 + rem5;
  quot10 = quot5 / 2;

This would cover the important use case of getting the quotient and
remainder for division by 10.

However, a benchmark (source attached) indicates that this method
is much faster even when only one of quotient and remainder
of division by 10 is needed.  Numbers I got indicate that this
method is faster by about a factor of five than calling the
library version.

I hope this clears up the somewhat confusing string of comments
in the PR.

Best regards

        Thomas
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/time.h>
#include <stdlib.h>

typedef __uint128_t mytype;

double this_time ()
{
  struct timeval tv;
  gettimeofday (&tv, NULL);
  return tv.tv_sec + tv.tv_usec * 1e-6;
}

#define ONE ((__uint128_t) 1)
#define TWO_64 (ONE << 64)

unsigned rem10_v1 (mytype n)
{
  return n % 10;
}

unsigned rem10_v2 (mytype n)
{
  mytype quot5, tmp;
  const mytype magic = (0xCCCCCCCCCCCCCCCC * TWO_64 + 0xCCCCCCCCCCCCCCCD * ONE);
  unsigned rem5 = n % 5;
  quot5 = (n - rem5) * magic;
  return rem5 + (quot5 % 2) * 5;
}

#define N 10000000

int main()
{
  mytype *a;
  unsigned long s;
  int fd;
  double t1, t2;

  a = malloc (sizeof (*a) * N);
  fd = open ("/dev/urandom", O_RDONLY);
  read (fd, a, sizeof (*a) * N);
  close (fd);

  s = 0;
  t1 = this_time();
  for (int i=0; i<N; i++)
    s += rem10_v1 (a[i]);
  t2 = this_time();
  printf ("s = %lu rem10_v1: %f s\n", s, (t2-t1));

  s = 0;
  t1 = this_time();
  for (int i=0; i<N; i++)
    s += rem10_v2 (a[i]);
  t2 = this_time();
  printf ("s = %lu rem10_v1: %f s\n", s, (t2-t1));
  
}

Reply via email to