------- Additional Comments From uros at kss-loka dot si 2005-01-27 09:07 ------- I don't think that this has anything to do with regstack and sched2. The fact is, that for fp-intensive applications, 8 FP regs (either stacked x87 or non-stack SSE type) is not enough. When there is a shorthage of registers, gcc starts to swap registers to and from memory.
Please note that reg/reg and reg/mem fops have the same latency/throuhput on P4, but moving FP registers to and from memory introduces a big performance penalty and these moves should be minimised as much as possible. There are some measurements to prove this (-O2 only to avoid fast-math intrinsic shortcuts, P4-3.2 timings): a) -march=pentium -mfpmath=387: scheduling and reg-stack interactions: real 0m34.073s user 0m33.756s sys 0m0.018s b) -march=pentium -msse2 -mfpmath=sse: scheduling and no reg-stack: real 0m35.063s user 0m34.674s sys 0m0.076s c) -march=pentium4 -mfpmath=387: no scheduling with reg-stack: real 0m33.720s user 0m33.348s sys 0m0.037s d) -march=pentium4 -mfpmath=sse: no scheduling and no reg-stack: real 0m35.399s user 0m35.016s sys 0m0.035s The question I would like to ask: is there a functionality in gcc to optimise register moving, considering the cost of reg/reg vs. reg/mem FP operators and the cost of register<->mem move? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=8126