On Fri, Aug 3, 2012 at 7:21 AM, George Spelvin <li...@horizon.com> wrote: > The same multiply-by-inverse technique can be used to > convert division by 10000 to a 32x32->64-bit multiply. > > Signed-off-by: George Spelvin <li...@horizon.com> > --- > lib/vsprintf.c | 60 > +++++++++++++++++++++++++++++++------------------------- > 1 file changed, 33 insertions(+), 27 deletions(-) > > This is something of an RFC, I haven't benchmarked the helper > function. But it sure cleans up the code!
Can you please do that before you send alterations to the code which *was* benchmarked? > +static > +unsigned put_dec_helper4(char *buf, unsigned x) > +{ > + uint32_t q = (x * (uint64_t)0x346DC5D7) >> 43; > + > + put_dec_full4(buf, x - q * 10000); > + return q; > } > > /* Based on code by Douglas W. Jones found at > @@ -277,28 +292,19 @@ char *put_dec(char *buf, unsigned long long n) > d3 = (h >> 16); /* implicit "& 0xffff" */ > > q = 656 * d3 + 7296 * d2 + 5536 * d1 + ((uint32_t)n & 0xffff); > + q = put_dec_helper4(buf, q); > + > + q += 7671 * d3 + 9496 * d2 + 6 * d1; > + q = put_dec_helper4(buf+4, q); > + > + q += 4749 * d3 + 42 * d2; > + q = put_dec_helper4(buf+8, q); > > - buf = put_dec_full4(buf, q % 10000); > - q = q / 10000; > - > - d1 = q + 7671 * d3 + 9496 * d2 + 6 * d1; > - buf = put_dec_full4(buf, d1 % 10000); > - q = d1 / 10000; > - > - d2 = q + 4749 * d3 + 42 * d2; > - buf = put_dec_full4(buf, d2 % 10000); > - q = d2 / 10000; > - > - d3 = q + 281 * d3; > - if (!d3) > - goto done; > - buf = put_dec_full4(buf, d3 % 10000); > - q = d3 / 10000; > - if (!q) > - goto done; > - buf = put_dec_full4(buf, q); > - done: > - while (buf[-1] == '0') > + q += 281 * d3; > + buf += 12; > + if (q) > + buf = put_dec_trunc8(buf, q); > + else while (buf[-1] == '0') > --buf; gcc was already doing that optimization for us. It will use a larger constant and shift, but that changes neither size nor speed. Moreover, put_dec_helper4 will get inlined, so the generated assembly code will be very similar. Here is the comparison of the x86-32 assembly of the fragment which does "x / 10000" thing, before and after the patch: -01 c6 add %eax,%esi -b8 59 17 b7 d1 mov $0xd1b71759,%eax -f7 e6 mul %esi -89 d3 mov %edx,%ebx -89 f2 mov %esi,%edx -c1 eb 0d shr $0xd,%ebx +01 c7 add %eax,%edi +b8 d7 c5 6d 34 mov $0x346dc5d7,%eax +f7 e7 mul %edi +89 55 e8 mov %edx,-0x18(%ebp) +8b 5d e8 mov -0x18(%ebp),%ebx +89 fa mov %edi,%edx +89 45 e4 mov %eax,-0x1c(%ebp) +c1 eb 0b shr $0xb,%ebx Poor gcc got confused, and generated somewhat worse code (spilling and immediately reloading upper part of 32x32->64 multiply). Please test and benchmark your changes to this code before submitting them. -- vda -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/