[RFC/RFT] Should we hoist FP constants on x87?

Roger Sayle Tue, 27 Dec 2005 18:29:14 -0800

One significant (philosophical) difference between the floating point
code generated by GCC vs that generated by commercial compilers for
IA-32, is the decision whether or not to hoist floating point constants
on the x87.  Or phrased equivalently, whether to allocate an x87 stack
register to hold compile-time FP constants over basic block boundaries.



Consider the following code:

double a[10];
double b[10];

void foo(int n)
{
  int i;
  for (i=0; i<n; i++)
  {
    a[i] = 3.0*a[i] + 4.0*b[i];
    b[i] = 4.0*a[i] + 3.0*b[i];
  }
}

The choice is whether to place the FP constants 3.0 and 4.0 into their own
registers and load them before the loop, or load then from the constant
pool (materialize them) during each iteration of the loop.  On most
targets, this decision of whether to hold constants in registers is a
finely balanced trade-off.  On x87, the balance is additionally affected
both by the small number of FP registers, and by its register stack
organization, that forces us to add compensating code in and around the
loop, to shuffle the operands to the top of stack before use, and pop
them from the stack after the loop finishes.


The current choice made by GCC is to PRE these values into registers,
whereas both the Intel and Microsoft compilers choose to load constant
operands at the point they are needed.  The patch below reverses this
decision to allow us to benchmark/investigate the effects of reducing
x87 register pressure.

Consider the effect on loop N8 of whetstone (when compiled with -O2
-ffast-math), before:

.L100:  fxch    %st(6)
.L78:   fld     %st(3)
        fxch    %st(1)
        fxch    %st(7)
        fyl2x
        fmul    %st(2), %st
        fmul    %st(1), %st
        fld     %st(0)
        frndint
        fsubr   %st, %st(1)
        fxch    %st(1)
        f2xm1
        fadd    %st(7), %st
        fscale
        fstp    %st(1)
        incl    %eax
        cmpl    -160(%ebp), %eax
        jne     .L100
        fstp    %st(6)
        fstp    %st(0)
        fstp    %st(0)
        fstp    %st(0)
        fxch    %st(1)
        fxch    %st(2)
        ...

vs. after

.L78:   fldt    .LC26
        fxch    %st(1)
        fyl2x
        fmull   .LC27
        fldt    .LC28
        fmulp   %st, %st(1)
        fld     %st(0)
        frndint
        fsubr   %st, %st(1)
        fxch    %st(1)
        f2xm1
        fld1
        faddp   %st, %st(1)
        fscale
        fstp    %st(1)
        incl    %eax
        cmpl    -148(%ebp), %eax
        jne     .L78


You'll notice that the second sequence contains an "fld1" used
to load the constant 1.0 as part of the "exp" inline intrinsic.
Whilst in the first, this and other constants have been hoisted
into FP registers, and cause a large amount of shuffling on the
stack.  If nothing else, this change may be useful for -Os.


Of course, this decision (to hoist or not to hoist) requires a
significant amount of benchmarking to decide whether it is more
generally a win on real code, POV-Ray, SPECfp2000, etc...  It may
also be dependent upon the IA-32 processor generation and manufacturer
as x87 stack manipulation is much cheaper on some Pentium familes
than other chipsets.  I'm posting this patch here in the hope
that it triggers some feedback and/or discussion on the debate.


[p.s. I was hoping that progress on killing loop.c would have
progressed to the point that this change would be a trivial
tweak to want_to_gcse_p, but alas this modification is encumbered
by a few minor changes to the soon-to-be obsolete loop.c]

Thoughts?



2005-12-27  Roger Sayle  <[EMAIL PROTECTED]>

        * gcse.c (want_to_gcse_p): On STACK_REGS targets, look through
        constant pool references to identify stack mode constants.
        * loop.c (constant_pool_constant_p): New predicate to check
        whether operand is a floating point constant in the pool.
        (scan_loop): Avoid hoisting constants from the constant pool
        on STACK_REGS targets.
        (load_mems): Likewise.


Index: gcse.c
===================================================================
*** gcse.c      (revision 108834)
--- gcse.c      (working copy)
*************** static basic_block current_bb;
*** 1184,1189 ****
--- 1184,1197 ----
  static int
  want_to_gcse_p (rtx x)
  {
+ #ifdef STACK_REGS
+   /* On register stack architectures, don't GCSE constants from the
+      constant pool, as the benefits are often swamped by the overhead
+      of shuffling the register stack between basic blocks.  */
+   if (IS_STACK_MODE (GET_MODE (x)))
+     x = avoid_constant_pool_reference (x);
+ #endif
+
    switch (GET_CODE (x))
      {
      case REG:
Index: loop.c
===================================================================
*** loop.c      (revision 108834)
--- loop.c      (working copy)
*************** find_regs_nested (rtx deps, rtx x)
*** 977,982 ****
--- 977,991 ----
    return deps;
  }

+ /* Check whether this is a constant pool constant.  */
+ bool
+ constant_pool_constant_p (rtx x)
+ {
+   x = avoid_constant_pool_reference (x);
+   return GET_CODE (x) == CONST_DOUBLE;
+ }
+
+
  /* Optimize one loop described by LOOP.  */

  /* ??? Could also move memory writes out of loops if the destination address
*************** scan_loop (struct loop *loop, int flags)
*** 1228,1233 ****
--- 1237,1248 ----
              if (GET_MODE_CLASS (GET_MODE (SET_DEST (set))) == MODE_CC
                  && CONSTANT_P (src))
                ;
+ #ifdef STACK_REGS
+             /* Don't hoist constant pool constants into stack regs. */
+             else if (IS_STACK_MODE (GET_MODE (SET_SRC (set)))
+                      && constant_pool_constant_p (SET_SRC (set)))
+               ;
+ #endif
              /* Don't try to optimize a register that was made
                 by loop-optimization for an inner loop.
                 We don't know its life-span, so we can't compute
*************** load_mems (const struct loop *loop)
*** 10830,10835 ****
--- 10845,10857 ----
          && SCALAR_FLOAT_MODE_P (GET_MODE (mem)))
        loop_info->mems[i].optimize = 0;

+ #ifdef STACK_REGS
+       /* Don't hoist constant pool constants into stack registers.  */
+       if (IS_STACK_MODE (GET_MODE (mem))
+           && constant_pool_constant_p (mem))
+       loop_info->mems[i].optimize = 0;
+ #endif
+
        /* If this MEM is written to, we must be sure that there
         are no reads from another MEM that aliases this one.  */
        if (loop_info->mems[i].optimize && written)


Roger
--

[RFC/RFT] Should we hoist FP constants on x87?

Reply via email to