------- Comment #11 from ubizjak at gmail dot com 2007-06-10 08:28 ------- I have experimented a bit with rcpss, trying to measure the effect of additional NR step to the performance. NR step was calculated based on http://en.wikipedia.org/wiki/N-th_root_algorithm, and for N=-1 (1/A) we can simplify to:
x1 = x0 (2.0 - A X0) To obtain 24bit precision, we have to use a reciprocal, two multiplies and subtraction (+ a constant load). First, please note that "divss" instruction is quite _fast_, clocking at 23 cycles, where approximation with NR step would sum up to 20 cycles, not counting load of constant. I have checked the performance of following testcase with various implementetations on x86_64 C2D: --cut here-- float test(float a) { return 1.0 / a; } int main() { float a = 1.12345; volatile float t; int i; for (i = 1; i < 1000000000; i++) { t += test (a); a += 1.0; } printf("%f\n", t); return 0; } --cut here-- divss : 3.132s rcpss NR : 3.264s rcpss only: 3.080s To enhance the precision of 1/sqrt(A), additional NR step is calculated as x1 = 0.5 X0 (3.0 - A x0 x0 x0) and considering that sqrtss also clocks at 23 clocks (_far_ from hundreds of clocks ;) ), additional NR step just isn't worth it. The experimental patch: Index: i386.md =================================================================== --- i386.md (revision 125599) +++ i386.md (working copy) @@ -15399,6 +15399,15 @@ ;; Gcc is slightly more smart about handling normal two address instructions ;; so use special patterns for add and mull. +(define_insn "*rcpsf2_sse" + [(set (match_operand:SF 0 "register_operand" "=x") + (unspec:SF [(match_operand:SF 1 "nonimmediate_operand" "xm")] + UNSPEC_RCP))] + "TARGET_SSE" + "rcpss\t{%1, %0|%0, %1}" + [(set_attr "type" "sse") + (set_attr "mode" "SF")]) + (define_insn "*fop_sf_comm_mixed" [(set (match_operand:SF 0 "register_operand" "=f,x") (match_operator:SF 3 "binary_fp_operator" @@ -15448,6 +15457,29 @@ (const_string "fop"))) (set_attr "mode" "SF")]) +(define_insn_and_split "*rcp_sf_1_sse" + [(set (match_operand:SF 0 "register_operand" "=x") + (div:SF (match_operand:SF 1 "immediate_operand" "F") + (match_operand:SF 2 "nonimmediate_operand" "xm"))) + (clobber (match_scratch:SF 3 "=&x")) + (clobber (match_scratch:SF 4 "=&x"))] + "TARGET_SSE_MATH + && operands[1] == CONST1_RTX (SFmode) + && flag_unsafe_math_optimizations" + "#" + "&& reload_completed" + [(set (match_dup 3)(match_dup 2)) + (set (match_dup 4)(match_dup 5)) + (set (match_dup 0)(unspec:SF [(match_dup 3)] UNSPEC_RCP)) + (set (match_dup 3)(mult:SF (match_dup 3)(match_dup 0))) + (set (match_dup 4)(minus:SF (match_dup 4)(match_dup 3))) + (set (match_dup 0)(mult:SF (match_dup 0)(match_dup 4)))] +{ + rtx two = const_double_from_real_value (dconst2, SFmode); + + operands[5] = validize_mem (force_const_mem (SFmode, two)); +}) + (define_insn "*fop_sf_1_mixed" [(set (match_operand:SF 0 "register_operand" "=f,f,x") (match_operator:SF 3 "binary_fp_operator" Based on these findings, I guess that NR step is just not worth it. If we want to have noticeable speed-up on division and square root, we have to use 12bit implementations, without any refinements - mainly for benchmarketing, I'm afraid. BTW: on x86_64, patched gcc compiles "test" function to: test: movaps %xmm0, %xmm1 rcpss %xmm0, %xmm0 movss .LC1(%rip), %xmm2 mulss %xmm0, %xmm1 subss %xmm1, %xmm2 mulss %xmm2, %xmm0 ret -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723