------- Comment #13 from jb at gcc dot gnu dot org 2007-06-10 11:06 ------- (In reply to comment #11)
Thanks for the work. > First, please note that "divss" instruction is quite _fast_, clocking at 23 > cycles, where approximation with NR step would sum up to 20 cycles, not > counting load of constant. > > I have checked the performance of following testcase with various > implementetations on x86_64 C2D: > > --cut here-- > float test(float a) > { > return 1.0 / a; > } > > divss : 3.132s > rcpss NR : 3.264s > rcpss only: 3.080s Interesting, on ubuntu/i686/K8 I get (average of 3 runs) divss: 7.485 s rcpss NR: 9.915 s > To enhance the precision of 1/sqrt(A), additional NR step is calculated as > > x1 = 0.5 X0 (3.0 - A x0 x0 x0) > > and considering that sqrtss also clocks at 23 clocks (_far_ from hundreds of > clocks ;) ), additional NR step just isn't worth it. Well, I suppose it depends on the hardware. IIRC older cpu:s did division with microcode whereas at least core2 and K8 do it in hardware, so I guess the hundreds of cycles doesn't apply to current cpu:s. Also, supposedly Penryn will have a much improved divider.. That being said, I think there is still a case for the reciprocal square root, as evidenced by the benchmarks in #5 and #7 as well as my analysis of gas_dyn linked to in the first message in this PR (in short, ifort does sqrt(a/b) about twice as fast as gfortran by using reciprocal approximations + NR). If indeed div(p|s)s is about equally fast as rcp(p|s)s as your benchmarks show, then it suggests almost all the performance benefit ifort gets is due to the rsqrt(p|s)s, no? Or perhaps there is some issue with pipelining? In gas_dyn the sqrt(a/b) loop fills an array, whereas your benchmark accumulates.. > Based on these findings, I guess that NR step is just not worth it. If we want > to have noticeable speed-up on division and square root, we have to use 12bit > implementations, without any refinements - mainly for benchmarketing, I'm > afraid. I hear that it's possible to pass spec2k6/gromacs without the NR step. As most MD programs, gromacs spends almost all it's time in the force calculations, where the majority of time is spent calculating 1/sqrt(...). So perhaps one should watch out for compilers that get suspiciously high scores on that benchmark. :) No, I'm not suggesting gcc should do this. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723