Hello Jakub, thanks a lot for taking this on!
As far as I can tell, the patch works very well, and really speeds up things. As (somewhat confusingly) discussed in the PR, there are a couple of things that could still be done incrementally with this method. Fist, it can be combined with, or even used for, the calulation of the quotient. Let a be a number for which your patch works, for example 5. If you want to calculate n / 5, you can do rem = n % 5; /* Fast, with your patch. */ quot = (n - rem) * magic; in a fast way, where magic is the multiplicative inverse of 5 modulo 2^128 (for a 128-bit number) or 2^64 (for a 64-bit number), which can be calculated using gmp_invert. The multiplicative inverse for division works because n - rem is divisible by 5. Second, you can also use this for the quotient and/or remainder by 2*a (for example 10), with a slight adjustment: rem5 = n % 5; quot5 = (n - rem5) * magic; rem10 = (quot5 % 2) * 5 + rem5; quot10 = quot5 / 2; This would cover the important use case of getting the quotient and remainder for division by 10. However, a benchmark (source attached) indicates that this method is much faster even when only one of quotient and remainder of division by 10 is needed. Numbers I got indicate that this method is faster by about a factor of five than calling the library version. I hope this clears up the somewhat confusing string of comments in the PR. Best regards Thomas
#include <stdio.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <sys/time.h> #include <stdlib.h> typedef __uint128_t mytype; double this_time () { struct timeval tv; gettimeofday (&tv, NULL); return tv.tv_sec + tv.tv_usec * 1e-6; } #define ONE ((__uint128_t) 1) #define TWO_64 (ONE << 64) unsigned rem10_v1 (mytype n) { return n % 10; } unsigned rem10_v2 (mytype n) { mytype quot5, tmp; const mytype magic = (0xCCCCCCCCCCCCCCCC * TWO_64 + 0xCCCCCCCCCCCCCCCD * ONE); unsigned rem5 = n % 5; quot5 = (n - rem5) * magic; return rem5 + (quot5 % 2) * 5; } #define N 10000000 int main() { mytype *a; unsigned long s; int fd; double t1, t2; a = malloc (sizeof (*a) * N); fd = open ("/dev/urandom", O_RDONLY); read (fd, a, sizeof (*a) * N); close (fd); s = 0; t1 = this_time(); for (int i=0; i<N; i++) s += rem10_v1 (a[i]); t2 = this_time(); printf ("s = %lu rem10_v1: %f s\n", s, (t2-t1)); s = 0; t1 = this_time(); for (int i=0; i<N; i++) s += rem10_v2 (a[i]); t2 = this_time(); printf ("s = %lu rem10_v1: %f s\n", s, (t2-t1)); }