------- Comment #11 from ubizjak at gmail dot com  2007-06-10 08:28 -------
I have experimented a bit with rcpss, trying to measure the effect of
additional NR step to the performance. NR step was calculated based on
http://en.wikipedia.org/wiki/N-th_root_algorithm, and for N=-1 (1/A) we can
simplify to:

x1 = x0 (2.0 - A X0)

To obtain 24bit precision, we have to use a reciprocal, two multiplies and
subtraction (+ a constant load).

First, please note that "divss" instruction is quite _fast_, clocking at 23
cycles, where approximation with NR step would sum up to 20 cycles, not
counting load of constant.

I have checked the performance of following testcase with various
implementetations on x86_64 C2D:

--cut here--
float test(float a)
{
  return 1.0 / a;
}


int main()
{
  float a = 1.12345;
  volatile float t;
  int i;

  for (i = 1; i < 1000000000; i++)
    {
      t += test (a);
      a += 1.0;
    }

  printf("%f\n", t);

  return 0;
}
--cut here--

divss     : 3.132s
rcpss NR  : 3.264s
rcpss only: 3.080s

To enhance the precision of 1/sqrt(A), additional NR step is calculated as

x1 = 0.5 X0 (3.0 - A x0 x0 x0)

and considering that sqrtss also clocks at 23 clocks (_far_ from hundreds of
clocks ;) ), additional NR step just isn't worth it.

The experimental patch:

Index: i386.md
===================================================================
--- i386.md     (revision 125599)
+++ i386.md     (working copy)
@@ -15399,6 +15399,15 @@
 ;; Gcc is slightly more smart about handling normal two address instructions
 ;; so use special patterns for add and mull.

+(define_insn "*rcpsf2_sse"
+  [(set (match_operand:SF 0 "register_operand" "=x")
+       (unspec:SF [(match_operand:SF 1 "nonimmediate_operand" "xm")]
+                  UNSPEC_RCP))]
+  "TARGET_SSE"
+  "rcpss\t{%1, %0|%0, %1}"
+  [(set_attr "type" "sse")
+   (set_attr "mode" "SF")])
+
 (define_insn "*fop_sf_comm_mixed"
   [(set (match_operand:SF 0 "register_operand" "=f,x")
        (match_operator:SF 3 "binary_fp_operator"
@@ -15448,6 +15457,29 @@
           (const_string "fop")))
    (set_attr "mode" "SF")])

+(define_insn_and_split "*rcp_sf_1_sse"
+  [(set (match_operand:SF 0 "register_operand" "=x")
+       (div:SF (match_operand:SF 1 "immediate_operand" "F")
+               (match_operand:SF 2 "nonimmediate_operand" "xm")))
+   (clobber (match_scratch:SF 3 "=&x"))
+   (clobber (match_scratch:SF 4 "=&x"))]
+  "TARGET_SSE_MATH
+   && operands[1] == CONST1_RTX (SFmode)
+   && flag_unsafe_math_optimizations"
+   "#"
+   "&& reload_completed"
+   [(set (match_dup 3)(match_dup 2))
+    (set (match_dup 4)(match_dup 5))
+    (set (match_dup 0)(unspec:SF [(match_dup 3)] UNSPEC_RCP))
+    (set (match_dup 3)(mult:SF (match_dup 3)(match_dup 0)))
+    (set (match_dup 4)(minus:SF (match_dup 4)(match_dup 3)))
+    (set (match_dup 0)(mult:SF (match_dup 0)(match_dup 4)))]
+{
+  rtx two = const_double_from_real_value (dconst2, SFmode);
+
+  operands[5] = validize_mem (force_const_mem (SFmode, two));
+})
+
 (define_insn "*fop_sf_1_mixed"
   [(set (match_operand:SF 0 "register_operand" "=f,f,x")
        (match_operator:SF 3 "binary_fp_operator"

Based on these findings, I guess that NR step is just not worth it. If we want
to have noticeable speed-up on division and square root, we have to use 12bit
implementations, without any refinements - mainly for benchmarketing, I'm
afraid.

BTW: on x86_64, patched gcc compiles "test" function to:

test:
        movaps  %xmm0, %xmm1
        rcpss   %xmm0, %xmm0
        movss   .LC1(%rip), %xmm2
        mulss   %xmm0, %xmm1
        subss   %xmm1, %xmm2
        mulss   %xmm2, %xmm0
        ret


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723

Reply via email to