[Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3

whaley at cs dot utsa dot edu Thu, 01 Jun 2006 11:43:54 -0700


------- Comment #12 from whaley at cs dot utsa dot edu  2006-06-01 18:43 -------
Subject: Re:  gcc 4 produces worse x87 code on all platforms than gcc 3


Uros,


>gcc version 3.4.6
>vs.
>gcc version 4.2.0 20060601 (experimental)
>
>-fomit-frame-pointer -O -msse2 -mfpmath=sse
>There is a small performance drop on gcc-4.x, but nothing critical.
>I can confirm, that code indeed runs >50% slower on 64bit athlon. Perhaps the
>problem is in the order of instructions (Software Optimization Guide for AMD
>Athlon 64, Section 10.2). The gcc-3.4 code looks similar to the example, how
>things should be, and gcc-4.2 code looks similar to the example, how things
>should _NOT_ be.

Thanks for looking into this!  However, I am indeed aware that by using SSE2
you
can get the double precision results fairly close to the x87 on most platforms.
In fact, you can get gcc 4.1-sse within a few % of gcc 3-x87 on the Athlon 64
as well, by changing the kernel you feed gcc, and giving it these flags:
   -march=athlon64 -O2 -mfpmath=sse -msse -msse2 -m64 \ 
   -ftree-vectorize -fargument-noalias-global
(this doesn't make it vectorize, but I throw the flag for future hope :)

Now, sometimes you want to use the x87 unit because of its superior precision,
but the real problem with the approach of "ignore the x87 performance and
just use SSE" comes in single precision.  The performance of the best
kernel found by ATLAS in single precision using gcc4.1-sse is roughly half
of that of using the x87 unit on an Athlon-64, and 80% on a P4e (one reason
they are closer on the P4e is that the P4e's x87 peak is 1/2 that of the
Athlon [AMD machines can do 2 flops/cycle using the x87, whereas intel machines
can do only 1]), so there's not as large a gap between excellent and
non-so-excellent kernels).  My guess (and it's only a guess) for the reason
scalar double-precision sse can compete and single cannot comes down to the
cost of doing scalar load and stores.  In double, you can use movlpd instead of
movsd for a low-overhead vector load, but in single you must use movss, and
since movss is much more expensive than fld, scalar SSE always blows in
comparison to x87 . . .

So, that's why my error report concentrated on "x87 performance".  I submitted
in double precision because I had a preexisting Makefile/source demonstrating
the performance problem from a prior bug report (bugzilla 4991).  I think
we should not blow off the x87 performance even if SSE *was* competitive,
because there are times when the x87 is better.  However, in single precision,
scalar SSE is not competitive, at least on the platforms I have tried.  If you
guys are planning on deprecating the x87 unit when SSE is competitive on modern
machines, I can certainly rework the tarfile so I can send you single precision
benchmark, so you can see the sse/x87 performance gap yourself.  Let me know
if you want this, as I'll need to do a bit of extra work.

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

[Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3

Reply via email to