https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97282
--- Comment #1 from Thomas Koenig <tkoenig at gcc dot gnu.org> --- And here is a version which uses two 64-bit numbers for calculation of he sum of decimal digits as a benchmark for the division and modulo: unsigned long digsum3 (myint x) { unsigned long ret; __uint64_t high, low; const unsigned long int rem_high[10] = {0,6,2,8,4,0,6,2,8,4}; const unsigned long int foo_high[10] = {0x0000000000000000, 0x1999999999999999, 0x3333333333333333, 0x4CCCCCCCCCCCCCCC, 0x6666666666666666, 0x8000000000000000, 0x9999999999999999, 0xB333333333333333, 0xCCCCCCCCCCCCCCCC, 0xE666666666666666 }; high = x >> 64; low = x; ret = 0; while (low > 0 || high > 0) { unsigned long r_high, r_low, r_sum, r_carry; r_high = high % 10; r_carry = rem_high[r_high]; high = high / 10; r_low = low % 10; low = low / 10; low = low + foo_high[r_high]; r_sum = r_low + r_carry; if (r_sum >= 10) { r_sum = r_sum - 10; low ++; } ret = ret + r_sum; } return ret; } It is _much_ faster, taking around 250 to 260 cycles per calculation, a speedup of a factor of around 8 versus the original code.