https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98479
Bug ID: 98479 Summary: Missed optimization opportunity for unsigned __int128 modulo Product: gcc Version: 9.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: dabler at gmail dot com Target Milestone: --- I have found that manually calculating the % operator on __int128 is significantly faster than the built-in compiler operator. I will show you how to calculate modulo 9, but the method can be used to calculate modulo any other number. First, consider the built-in compiler operator: uint64_t mod9_v1(unsigned __int128 n) { return n % 9; } Now consider my manual implementation: uint64_t mod9_v2(unsigned __int128 n) { uint64_t r = 0; r += (uint32_t)(n); r += (uint32_t)(n >> 32) * (uint64_t)4; r += (uint32_t)(n >> 64) * (uint64_t)7; r += (uint32_t)(n >> 96); return r % 9; } Measuring over 100,000,000 random numbers gives the following results: mod9_v1 | 3.986052 secs mod9_v2 | 1.814339 secs GCC 9.3.0 with -march=native -O3 was used on AMD Ryzen Threadripper 2990WX. Note that 2^32 == 4 (mod 9), 2^64 == 7 (mod 9), etc.