------- Comment #8 from whaley at cs dot utsa dot edu  2006-05-31 14:12 -------
Subject: Re:  gcc 4 produces worse x87 code on all platforms than gcc 3

Uros,

>IMO the fact that gcc 3.x beats 4.x on this code could be attributed to pure 
>luck.

As far as understanding from first principles, performance on a modern x86
(which is busy doing OOE, register renaming, CISC/RISC translation, operand
fusion and fission, etc) is *always* a blind accident, IMHO :)   I've
hand-tuned code for the x87 for a *long* time (and written my own compilation
framework), and it has been my experience that only by trying different
schedules, instruction selection, etc. can you get decent performing code.  gcc
actually does an amazing job of x87 performance when it's working right, and I
always figured it had to empirically tweaked to get that level of performance. 
The fact that x87 performance always drops off at major releases (return to
first principles over discovered best-cases) seems to verify this . . .

So, I agree with you that the difference does not seem to have some big plan
behind it, but I want to stress that it is nonetheless critical: it happens to
all x87 codes on every x86 machine (I have so far tried Pentium-D, Athlon 64
X2, and P4e), and it happens no matter what optimized code I feed gcc 4.  Note
that ATLAS is not a static library, but rather uses a code generator to tune
matrix multiplication.  What this means is that ATLAS tries thousands of
different source implementations in trying to find one that will run the
fastest on the given architecture/compiler (the code generator does things like
tiling, register blocking, unroll & jam, software pipelining, unrolling, all at
the ANSI C source level, in an attempt to find the combo that the compiler/arch
likes etc).  On no x86 architecture I've installed on can gcc 4 compete with
gcc 3.  Thus, out of literally thousands of implementations on each platform,
gcc 4 cannot find one that it can compete with gcc 3's best 
 case.  I cannot, of course, send you thousands of codes and say "see all of
these are inferior", but they are, and the case I sent is not the worst.  For
instance, for single precision gemm on the Athlon 64, the kernel tuned for gcc
4 (best case of thousands taken) runs at 56.7% of the performance of the gcc
3-tuned kernel.  Nor does using SSE fix things: gcc 4 is still far slower using
SSE than gcc 3 using the x87 on all platforms, and for single precision, the
gap is worse than between x87 implementations!

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

Reply via email to