[Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3

whaley at cs dot utsa dot edu Sun, 25 Jun 2006 17:55:41 -0700


------- Comment #19 from whaley at cs dot utsa dot edu  2006-06-26 00:55 -------
Thanks for the info.  I'm sorry to hear that no performance regression tests
are done, but I guess it kind of explains why these problems reoccur :)


As to not unrolling, the fully unrolled case is almost always commandingly
better whenever I've looked at it.  After your note, I just tried on my P4,
using ATLAS's  P4 kernel, and I get (ku is inner loop unrolling, and nb=40, so
40 is fully unrolled):
  GCC 4 ku=1  : 1.65Gflop
  GCC 4 ku=40 : 1.84Gflop
  Gcc 3 ku=1  : 1.90Gflop
  Gcc 3 ku=40:  2.19Gflop

This is throwing the -funroll-loops flag.

BTW, gcc 4 w/o the -funroll-loops (ku=1) is indeed slower, at roughly 1.54 . .
.

Anyway, I've never found the performance of gcc ku=1 competitive with ku=<fully
unrolled> on any machine.  Even in assembly, I have to fully unroll the inner
loop to get near peak on all intel machines.  On the Opteron, you can get
within 5% or so with a rolled loop in assembly, but I've not gotten a C code to
do that.I think the gcc unrolling probably defaults to something like 4 or 8
(guess from performance, not verified): unrolling all the way (the loop is over
a compile-time constant) is the way to go . . .

When you said competitive, did you mean that gcc 4 ku=1 was competitive with
gcc 4 ku=40 or gcc 3 ku=1?  If the latter, I find it hard to believe unless you
use SSE for gcc 4 and something unexpected happens.  Even so, if you are using
SSE try it with the single precision kernel, where SSE cannot compete with the
x87 unit (even the broken one in gcc 4). 

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

[Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3

Reply via email to