------- Comment #51 from paolo dot bonzini at lu dot unisi dot ch  2006-08-09 
04:33 -------
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


> I've been scoping this a little closer on the Athlon64X2.  I have found that
> the patched gcc can achieve as much as 93% of theoretical peak (5218Mflop on a
> 2800Mhz Athlon64X2!) for in-cache matmul when the code generator is allowed to
> go to town.
Not unexpected.  Code was so tightly tuned for GCC 3, and so big were 
the changes between GCC 3 and 4, that you were comparing sort of apples 
to oranges.  It could be interesting to see which different 
optimizations are performed by your code generator for GCC 3 vs. GCC 4.
>>        fmull   1440(%rcx)
>> #else
>>        fldl    1440(%rcx)
>>        fmulp   %st,%st(1)
>> #endif
>>     
> To my surprise, on this arch, using the fldl/fmulp pair caused a performance
> drop.  So, either my SSE experience does not necessarily translate to x87, or
> the Opteron (where I did the SSE tuning) is subtly different than the
> Athlon64X2, or my memory of the tuning is faulty.  Just as a check, Paulo: is
> this the peephole you would do?
>   
In some sense, this is the peephole I would rather *not* do.  But the 
answer is yes. :-)

So, do you now agree that the bug would be fixed if the patch that is in 
GCC 4.2 was backported to GCC 4.1 (so that your users can use that)?

And do you still see the abysmal x87 single-precision FP performance?

Thanks!


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

Reply via email to