Hi guys.  My name is Clint Whaley, I'm the developer of ATLAS, an open source
linear algebra package:
   http://directory.fsf.org/atlas.html

My users are asking me to support gcc 4, but right now its x87 fp performance
is much worse than gcc 3.  Depending on the machine and code being run it
appears to be between 10-50% worse.  Here is a tarfile that allows you to
reproduce the problem on any machine:
   http://www.cs.utsa.edu/~whaley/mmbench4.tar.gz

I have timed under a Pentium-D (gcc 4 gets 85% of gcc 3's performance on
example code) and Athlon-64 X2 (gcc 4 gets 60% of gcc 3's performance).  This
is a typical kernel from ATLAS, not the worst . . .

By looking at the assembly (the provided makefile will gen it with "make
assall"), the differences seem fairly minor.  From what I can tell, mostly it
seems to come down to gcc 4 using a from memory fmull rather than loading ops
to the fpstack first.

I know that sse is the prefered target these days, but the x87 (when optimized
right) kills the single precision SSE unit in scalar mode due to the expense of
the scalar vector load, and the x87 unit is slightly faster even in double
precision (in scalar mode).  Gcc cannot yet auto-vectorize any ATLAS kernels.

Any help much appreciated,
Clint


-- 
           Summary: gcc 4 produces worse x87 code on all platforms than gcc
                    3
           Product: gcc
           Version: 4.1.1
            Status: UNCONFIRMED
          Severity: blocker
          Priority: P3
         Component: rtl-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: hiclint at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

Reply via email to