------- Comment #8 from whaley at cs dot utsa dot edu 2006-05-31 14:12 ------- Subject: Re: gcc 4 produces worse x87 code on all platforms than gcc 3
Uros, >IMO the fact that gcc 3.x beats 4.x on this code could be attributed to pure >luck. As far as understanding from first principles, performance on a modern x86 (which is busy doing OOE, register renaming, CISC/RISC translation, operand fusion and fission, etc) is *always* a blind accident, IMHO :) I've hand-tuned code for the x87 for a *long* time (and written my own compilation framework), and it has been my experience that only by trying different schedules, instruction selection, etc. can you get decent performing code. gcc actually does an amazing job of x87 performance when it's working right, and I always figured it had to empirically tweaked to get that level of performance. The fact that x87 performance always drops off at major releases (return to first principles over discovered best-cases) seems to verify this . . . So, I agree with you that the difference does not seem to have some big plan behind it, but I want to stress that it is nonetheless critical: it happens to all x87 codes on every x86 machine (I have so far tried Pentium-D, Athlon 64 X2, and P4e), and it happens no matter what optimized code I feed gcc 4. Note that ATLAS is not a static library, but rather uses a code generator to tune matrix multiplication. What this means is that ATLAS tries thousands of different source implementations in trying to find one that will run the fastest on the given architecture/compiler (the code generator does things like tiling, register blocking, unroll & jam, software pipelining, unrolling, all at the ANSI C source level, in an attempt to find the combo that the compiler/arch likes etc). On no x86 architecture I've installed on can gcc 4 compete with gcc 3. Thus, out of literally thousands of implementations on each platform, gcc 4 cannot find one that it can compete with gcc 3's best case. I cannot, of course, send you thousands of codes and say "see all of these are inferior", but they are, and the case I sent is not the worst. For instance, for single precision gemm on the Athlon 64, the kernel tuned for gcc 4 (best case of thousands taken) runs at 56.7% of the performance of the gcc 3-tuned kernel. Nor does using SSE fix things: gcc 4 is still far slower using SSE than gcc 3 using the x87 on all platforms, and for single precision, the gap is worse than between x87 implementations! Thanks, Clint -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827