------- Comment #11 from whaley at cs dot utsa dot edu 2006-06-01 16:26 ------- Subject: Re: gcc 4 produces worse x87 code on all platforms than gcc 3
Uros, OK, I originally replied a couple of hours ago, but that is not appearing on bugzilla for some reason, so I'll try again, this time CCing myself so I don't have to retype everything :) >gcc version 3.4.6 >vs. >gcc version 4.2.0 20060601 (experimental) > >-fomit-frame-pointer -O -msse2 -mfpmath=sse > >There is a small performance drop on gcc-4.x, but nothing critical. > >I can confirm, that code indeed runs >50% slower on 64bit athlon. Perhaps the >problem is in the order of instructions (Software Optimization Guide for AMD >Athlon 64, Section 10.2). The gcc-3.4 code looks similar to the example, how >things should be, and gcc-4.2 code looks similar to the example, how things >should _NOT_ be. First, thanks for looking into this! As to your point, yes, I am aware that gcc4-sse can get almost the same performance as gcc3-x87 (though not quite), and in fact can do so on the Athlon 64 as well, **but only for double precision**. To get SSE within a few percent of x87 on the AMD machine, you use a different kernel (remember, I'm sending you an example out of many), and throw the following flags: -march=athlon64 -O2 -mfpmath=sse -msse -msse2 -m64 \ -ftree-vectorize -fargument-noalias-global (note this does not vectorize the code, but I throw the flag in the hope that future versions will :) Note that my bug report concentrates on "x87 performance"! There are reasons to use x87 even if scalar SSE is competitive performance-wise, as the x87 unit produces much superior accuracy. However, even if we were to take the tack (and gcc may be doing this for all I know) that once scalar SSE can compete performance wise, the x87 unit will no longer be supported, we must also examine single precision performance. For single precision performance, I have never gotten any scalar SSE kernel to compete even close to the gcc3-x87 numbers. I believe (w/o having proved it) that this is probably due to the cost of using the scalar load: double precision can use the low-overhead movlpd instruction, but single must use MOVSS, which is **much** slower than FLD, and so any kernel using scalar SSE blows chunks. ATLAS's best case gcc4-sse kernel gets roughly half of the gcc-x87 performance on an Athlon-64, and something like 80% on a P4e (note that intel machines have half the theoretical peak for x87 [AMD: 2 flops/cycle, Intel: 1 flop/cycle]: getting a large % of performance gets easier the lower your peak gets!). I originally submitted a double precision kernel, because that showed the x87 performance problem, and allowed me to reuse the infrastructure I created for an earlier bug report (bugzilla 4991). I have just uploaded an example attachment that can time both single and double precision performance, if you want to confirm for yourself that SSE is not competitive for single precision. Thanks, Clint -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827