One significant (philosophical) difference between the floating point code generated by GCC vs that generated by commercial compilers for IA-32, is the decision whether or not to hoist floating point constants on the x87. Or phrased equivalently, whether to allocate an x87 stack register to hold compile-time FP constants over basic block boundaries.
Consider the following code: double a[10]; double b[10]; void foo(int n) { int i; for (i=0; i<n; i++) { a[i] = 3.0*a[i] + 4.0*b[i]; b[i] = 4.0*a[i] + 3.0*b[i]; } } The choice is whether to place the FP constants 3.0 and 4.0 into their own registers and load them before the loop, or load then from the constant pool (materialize them) during each iteration of the loop. On most targets, this decision of whether to hold constants in registers is a finely balanced trade-off. On x87, the balance is additionally affected both by the small number of FP registers, and by its register stack organization, that forces us to add compensating code in and around the loop, to shuffle the operands to the top of stack before use, and pop them from the stack after the loop finishes. The current choice made by GCC is to PRE these values into registers, whereas both the Intel and Microsoft compilers choose to load constant operands at the point they are needed. The patch below reverses this decision to allow us to benchmark/investigate the effects of reducing x87 register pressure. Consider the effect on loop N8 of whetstone (when compiled with -O2 -ffast-math), before: .L100: fxch %st(6) .L78: fld %st(3) fxch %st(1) fxch %st(7) fyl2x fmul %st(2), %st fmul %st(1), %st fld %st(0) frndint fsubr %st, %st(1) fxch %st(1) f2xm1 fadd %st(7), %st fscale fstp %st(1) incl %eax cmpl -160(%ebp), %eax jne .L100 fstp %st(6) fstp %st(0) fstp %st(0) fstp %st(0) fxch %st(1) fxch %st(2) ... vs. after .L78: fldt .LC26 fxch %st(1) fyl2x fmull .LC27 fldt .LC28 fmulp %st, %st(1) fld %st(0) frndint fsubr %st, %st(1) fxch %st(1) f2xm1 fld1 faddp %st, %st(1) fscale fstp %st(1) incl %eax cmpl -148(%ebp), %eax jne .L78 You'll notice that the second sequence contains an "fld1" used to load the constant 1.0 as part of the "exp" inline intrinsic. Whilst in the first, this and other constants have been hoisted into FP registers, and cause a large amount of shuffling on the stack. If nothing else, this change may be useful for -Os. Of course, this decision (to hoist or not to hoist) requires a significant amount of benchmarking to decide whether it is more generally a win on real code, POV-Ray, SPECfp2000, etc... It may also be dependent upon the IA-32 processor generation and manufacturer as x87 stack manipulation is much cheaper on some Pentium familes than other chipsets. I'm posting this patch here in the hope that it triggers some feedback and/or discussion on the debate. [p.s. I was hoping that progress on killing loop.c would have progressed to the point that this change would be a trivial tweak to want_to_gcse_p, but alas this modification is encumbered by a few minor changes to the soon-to-be obsolete loop.c] Thoughts? 2005-12-27 Roger Sayle <[EMAIL PROTECTED]> * gcse.c (want_to_gcse_p): On STACK_REGS targets, look through constant pool references to identify stack mode constants. * loop.c (constant_pool_constant_p): New predicate to check whether operand is a floating point constant in the pool. (scan_loop): Avoid hoisting constants from the constant pool on STACK_REGS targets. (load_mems): Likewise. Index: gcse.c =================================================================== *** gcse.c (revision 108834) --- gcse.c (working copy) *************** static basic_block current_bb; *** 1184,1189 **** --- 1184,1197 ---- static int want_to_gcse_p (rtx x) { + #ifdef STACK_REGS + /* On register stack architectures, don't GCSE constants from the + constant pool, as the benefits are often swamped by the overhead + of shuffling the register stack between basic blocks. */ + if (IS_STACK_MODE (GET_MODE (x))) + x = avoid_constant_pool_reference (x); + #endif + switch (GET_CODE (x)) { case REG: Index: loop.c =================================================================== *** loop.c (revision 108834) --- loop.c (working copy) *************** find_regs_nested (rtx deps, rtx x) *** 977,982 **** --- 977,991 ---- return deps; } + /* Check whether this is a constant pool constant. */ + bool + constant_pool_constant_p (rtx x) + { + x = avoid_constant_pool_reference (x); + return GET_CODE (x) == CONST_DOUBLE; + } + + /* Optimize one loop described by LOOP. */ /* ??? Could also move memory writes out of loops if the destination address *************** scan_loop (struct loop *loop, int flags) *** 1228,1233 **** --- 1237,1248 ---- if (GET_MODE_CLASS (GET_MODE (SET_DEST (set))) == MODE_CC && CONSTANT_P (src)) ; + #ifdef STACK_REGS + /* Don't hoist constant pool constants into stack regs. */ + else if (IS_STACK_MODE (GET_MODE (SET_SRC (set))) + && constant_pool_constant_p (SET_SRC (set))) + ; + #endif /* Don't try to optimize a register that was made by loop-optimization for an inner loop. We don't know its life-span, so we can't compute *************** load_mems (const struct loop *loop) *** 10830,10835 **** --- 10845,10857 ---- && SCALAR_FLOAT_MODE_P (GET_MODE (mem))) loop_info->mems[i].optimize = 0; + #ifdef STACK_REGS + /* Don't hoist constant pool constants into stack registers. */ + if (IS_STACK_MODE (GET_MODE (mem)) + && constant_pool_constant_p (mem)) + loop_info->mems[i].optimize = 0; + #endif + /* If this MEM is written to, we must be sure that there are no reads from another MEM that aliases this one. */ if (loop_info->mems[i].optimize && written) Roger --