On Fri, Aug 3, 2012 at 7:21 AM, George Spelvin <li...@horizon.com> wrote:
> The same multiply-by-inverse technique can be used to
> convert division by 10000 to a 32x32->64-bit multiply.
>
> Signed-off-by: George Spelvin <li...@horizon.com>
> ---
>  lib/vsprintf.c |   60 
> +++++++++++++++++++++++++++++++-------------------------
>  1 file changed, 33 insertions(+), 27 deletions(-)
>
> This is something of an RFC, I haven't benchmarked the helper
> function.  But it sure cleans up the code!

Can you please do that before you send alterations
to the code which *was* benchmarked?


> +static
> +unsigned put_dec_helper4(char *buf, unsigned x)
> +{
> +        uint32_t q = (x * (uint64_t)0x346DC5D7) >> 43;
> +
> +        put_dec_full4(buf, x - q * 10000);
> +        return q;
>  }
>
>  /* Based on code by Douglas W. Jones found at
> @@ -277,28 +292,19 @@ char *put_dec(char *buf, unsigned long long n)
>         d3  = (h >> 16); /* implicit "& 0xffff" */
>
>         q   = 656 * d3 + 7296 * d2 + 5536 * d1 + ((uint32_t)n & 0xffff);
> +       q = put_dec_helper4(buf, q);
> +
> +       q += 7671 * d3 + 9496 * d2 + 6 * d1;
> +       q = put_dec_helper4(buf+4, q);
> +
> +       q += 4749 * d3 + 42 * d2;
> +       q = put_dec_helper4(buf+8, q);
>
> -       buf = put_dec_full4(buf, q % 10000);
> -       q   = q / 10000;
> -
> -       d1  = q + 7671 * d3 + 9496 * d2 + 6 * d1;
> -       buf = put_dec_full4(buf, d1 % 10000);
> -       q   = d1 / 10000;
> -
> -       d2  = q + 4749 * d3 + 42 * d2;
> -       buf = put_dec_full4(buf, d2 % 10000);
> -       q   = d2 / 10000;
> -
> -       d3  = q + 281 * d3;
> -       if (!d3)
> -               goto done;
> -       buf = put_dec_full4(buf, d3 % 10000);
> -       q   = d3 / 10000;
> -       if (!q)
> -               goto done;
> -       buf = put_dec_full4(buf, q);
> - done:
> -       while (buf[-1] == '0')
> +       q += 281 * d3;
> +       buf += 12;
> +       if (q)
> +               buf = put_dec_trunc8(buf, q);
> +       else while (buf[-1] == '0')
>                 --buf;

gcc was already doing that optimization for us.
It will use a larger constant and shift, but that
changes neither size nor speed.

Moreover, put_dec_helper4 will get inlined,
so the generated assembly code will be very similar.

Here is the comparison of the x86-32 assembly
of the fragment which does "x / 10000" thing,
before and after the patch:

-01 c6                  add    %eax,%esi
-b8 59 17 b7 d1         mov    $0xd1b71759,%eax
-f7 e6                  mul    %esi
-89 d3                  mov    %edx,%ebx
-89 f2                  mov    %esi,%edx
-c1 eb 0d               shr    $0xd,%ebx

+01 c7                  add    %eax,%edi
+b8 d7 c5 6d 34         mov    $0x346dc5d7,%eax
+f7 e7                  mul    %edi
+89 55 e8               mov    %edx,-0x18(%ebp)
+8b 5d e8               mov    -0x18(%ebp),%ebx
+89 fa                  mov    %edi,%edx
+89 45 e4               mov    %eax,-0x1c(%ebp)
+c1 eb 0b               shr    $0xb,%ebx

Poor gcc got confused, and generated somewhat
worse code (spilling and immediately reloading upper
part of 32x32->64 multiply).

Please test and benchmark your changes to this code
before submitting them.

-- 
vda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to