------- Comment #51 from paolo dot bonzini at lu dot unisi dot ch 2006-08-09 04:33 ------- Subject: Re: [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
> I've been scoping this a little closer on the Athlon64X2. I have found that > the patched gcc can achieve as much as 93% of theoretical peak (5218Mflop on a > 2800Mhz Athlon64X2!) for in-cache matmul when the code generator is allowed to > go to town. Not unexpected. Code was so tightly tuned for GCC 3, and so big were the changes between GCC 3 and 4, that you were comparing sort of apples to oranges. It could be interesting to see which different optimizations are performed by your code generator for GCC 3 vs. GCC 4. >> fmull 1440(%rcx) >> #else >> fldl 1440(%rcx) >> fmulp %st,%st(1) >> #endif >> > To my surprise, on this arch, using the fldl/fmulp pair caused a performance > drop. So, either my SSE experience does not necessarily translate to x87, or > the Opteron (where I did the SSE tuning) is subtly different than the > Athlon64X2, or my memory of the tuning is faulty. Just as a check, Paulo: is > this the peephole you would do? > In some sense, this is the peephole I would rather *not* do. But the answer is yes. :-) So, do you now agree that the bug would be fixed if the patch that is in GCC 4.2 was backported to GCC 4.1 (so that your users can use that)? And do you still see the abysmal x87 single-precision FP performance? Thanks! -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827