Thanks to Denys Vlasenko for sending me his benchmarking code.

I went and hacked on it to ransomize the numbers being converted more,
since repeatedly converting the same number underestimates the number
of branch mispredictions.

Then I tried computing the number of digits beforehand, as mentioned
in my earlier message, and it came out slightly slower.  The code is:

static noinline_for_stack
char *put_dec_trunc8(char *buf, unsigned n)
{
        static unsigned const pow10[9] = { 0, 10, 100, 1000, 10000, 100000,
                                         1000000, 10000000, 100000000 };
        unsigned digits = (19 * fls(n) + 6) >> 6;  /* Valid for < 44 bits */
        unsigned t;

        /*
         * "Digits" is an estimate, always exact or high by 1, of the
         * number of decimal digits in r.  It is offset by 1, so digits=0
         * means 1 digit.  Given that the largest value ever passed to
         * this function has 27 bits, The largest value is (19 * 27 +
         * 6) >> 6 = 8, meaning 9 digits.  However, that will always be
         * rounded down because r is less than pow10[8] = 10^8.
         *
         * We could use smaller multipliers in cases 4 through 6, but that
         * would require a shift by less than 32, which is awkward on a
         * 32-bit processor and wastes more time than the larger multiply.
         */
        switch (digits - (n < pow10[digits])) {
        default:
                __builtin_unreachable();
        case 7:         /* q <= 99999999 */
                t = n + '0';
                n  = (n * (uint64_t)0x1999999a) >> 32;
                *buf++ = t - 10*n;
        case 6:         /* q <= 9999999 */
                t = n + '0';
                n  = (n * (uint64_t)0x1999999a) >> 32;
                *buf++ = t - 10*n;
        case 5:         /* t <= 999999 */
                t = n + '0';
                n  = (n * (uint64_t)0x1999999a) >> 32;
                *buf++ = t - 10*n;
        case 4:         /* t <= 99999 */
                t = n + '0';
                n  = (n * (uint64_t)0x1999999a) >> 32;
                *buf++ = t - 10*n;
        case 3:         /* t <= 9999 */
                t = n + '0';
                n  = (n * 0x199a) >> 16;
                *buf++ = t - 10*n;
        case 2:         /* t <= 999 */
                t = n + '0';
                n  = (n * 0xcd) >> 11;
                *buf++ = t - 10*n;
        case 1:         /* t <= 99 */
                t = n + '0';
                n  = (n * 0xcd) >> 11;
                *buf++ = t - 10*n;
        case 0:         /* t <= 9 */
                *buf++ = n + '0';
        }
        return buf;
}


But!  With more extensive refactoring of the number() code, computing
the number of digits at the beginning can eliminate the entire business
of formatting into tmp[] and copying backward.  We'll first compute the
number of digits, check for buffer overflow, insert the printf padding,
and then call the number-formatting code, passing the number of digits in.

It seems plausible that the resultant simplification will produce a
speedup.

I'm going to experiment with that.  Denys, since I'm playing in your sandbox,
do you have any violent objections to that?  (Assuming, of course, that it's
a net speed win and my surgery to function signatures isn't too hideous.)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to