------- Comment #12 from whaley at cs dot utsa dot edu 2006-06-01 18:43 ------- Subject: Re: gcc 4 produces worse x87 code on all platforms than gcc 3
Uros, >gcc version 3.4.6 >vs. >gcc version 4.2.0 20060601 (experimental) > >-fomit-frame-pointer -O -msse2 -mfpmath=sse >There is a small performance drop on gcc-4.x, but nothing critical. >I can confirm, that code indeed runs >50% slower on 64bit athlon. Perhaps the >problem is in the order of instructions (Software Optimization Guide for AMD >Athlon 64, Section 10.2). The gcc-3.4 code looks similar to the example, how >things should be, and gcc-4.2 code looks similar to the example, how things >should _NOT_ be. Thanks for looking into this! However, I am indeed aware that by using SSE2 you can get the double precision results fairly close to the x87 on most platforms. In fact, you can get gcc 4.1-sse within a few % of gcc 3-x87 on the Athlon 64 as well, by changing the kernel you feed gcc, and giving it these flags: -march=athlon64 -O2 -mfpmath=sse -msse -msse2 -m64 \ -ftree-vectorize -fargument-noalias-global (this doesn't make it vectorize, but I throw the flag for future hope :) Now, sometimes you want to use the x87 unit because of its superior precision, but the real problem with the approach of "ignore the x87 performance and just use SSE" comes in single precision. The performance of the best kernel found by ATLAS in single precision using gcc4.1-sse is roughly half of that of using the x87 unit on an Athlon-64, and 80% on a P4e (one reason they are closer on the P4e is that the P4e's x87 peak is 1/2 that of the Athlon [AMD machines can do 2 flops/cycle using the x87, whereas intel machines can do only 1]), so there's not as large a gap between excellent and non-so-excellent kernels). My guess (and it's only a guess) for the reason scalar double-precision sse can compete and single cannot comes down to the cost of doing scalar load and stores. In double, you can use movlpd instead of movsd for a low-overhead vector load, but in single you must use movss, and since movss is much more expensive than fld, scalar SSE always blows in comparison to x87 . . . So, that's why my error report concentrated on "x87 performance". I submitted in double precision because I had a preexisting Makefile/source demonstrating the performance problem from a prior bug report (bugzilla 4991). I think we should not blow off the x87 performance even if SSE *was* competitive, because there are times when the x87 is better. However, in single precision, scalar SSE is not competitive, at least on the platforms I have tried. If you guys are planning on deprecating the x87 unit when SSE is competitive on modern machines, I can certainly rework the tarfile so I can send you single precision benchmark, so you can see the sse/x87 performance gap yourself. Let me know if you want this, as I'll need to do a bit of extra work. Thanks, Clint -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827