Thanks to Denys Vlasenko for sending me his benchmarking code. I went and hacked on it to ransomize the numbers being converted more, since repeatedly converting the same number underestimates the number of branch mispredictions.
Then I tried computing the number of digits beforehand, as mentioned in my earlier message, and it came out slightly slower. The code is: static noinline_for_stack char *put_dec_trunc8(char *buf, unsigned n) { static unsigned const pow10[9] = { 0, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000 }; unsigned digits = (19 * fls(n) + 6) >> 6; /* Valid for < 44 bits */ unsigned t; /* * "Digits" is an estimate, always exact or high by 1, of the * number of decimal digits in r. It is offset by 1, so digits=0 * means 1 digit. Given that the largest value ever passed to * this function has 27 bits, The largest value is (19 * 27 + * 6) >> 6 = 8, meaning 9 digits. However, that will always be * rounded down because r is less than pow10[8] = 10^8. * * We could use smaller multipliers in cases 4 through 6, but that * would require a shift by less than 32, which is awkward on a * 32-bit processor and wastes more time than the larger multiply. */ switch (digits - (n < pow10[digits])) { default: __builtin_unreachable(); case 7: /* q <= 99999999 */ t = n + '0'; n = (n * (uint64_t)0x1999999a) >> 32; *buf++ = t - 10*n; case 6: /* q <= 9999999 */ t = n + '0'; n = (n * (uint64_t)0x1999999a) >> 32; *buf++ = t - 10*n; case 5: /* t <= 999999 */ t = n + '0'; n = (n * (uint64_t)0x1999999a) >> 32; *buf++ = t - 10*n; case 4: /* t <= 99999 */ t = n + '0'; n = (n * (uint64_t)0x1999999a) >> 32; *buf++ = t - 10*n; case 3: /* t <= 9999 */ t = n + '0'; n = (n * 0x199a) >> 16; *buf++ = t - 10*n; case 2: /* t <= 999 */ t = n + '0'; n = (n * 0xcd) >> 11; *buf++ = t - 10*n; case 1: /* t <= 99 */ t = n + '0'; n = (n * 0xcd) >> 11; *buf++ = t - 10*n; case 0: /* t <= 9 */ *buf++ = n + '0'; } return buf; } But! With more extensive refactoring of the number() code, computing the number of digits at the beginning can eliminate the entire business of formatting into tmp[] and copying backward. We'll first compute the number of digits, check for buffer overflow, insert the printf padding, and then call the number-formatting code, passing the number of digits in. It seems plausible that the resultant simplification will produce a speedup. I'm going to experiment with that. Denys, since I'm playing in your sandbox, do you have any violent objections to that? (Assuming, of course, that it's a net speed win and my surgery to function signatures isn't too hideous.) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/