Hi guys. My name is Clint Whaley, I'm the developer of ATLAS, an open source linear algebra package: http://directory.fsf.org/atlas.html
My users are asking me to support gcc 4, but right now its x87 fp performance is much worse than gcc 3. Depending on the machine and code being run it appears to be between 10-50% worse. Here is a tarfile that allows you to reproduce the problem on any machine: http://www.cs.utsa.edu/~whaley/mmbench4.tar.gz I have timed under a Pentium-D (gcc 4 gets 85% of gcc 3's performance on example code) and Athlon-64 X2 (gcc 4 gets 60% of gcc 3's performance). This is a typical kernel from ATLAS, not the worst . . . By looking at the assembly (the provided makefile will gen it with "make assall"), the differences seem fairly minor. From what I can tell, mostly it seems to come down to gcc 4 using a from memory fmull rather than loading ops to the fpstack first. I know that sse is the prefered target these days, but the x87 (when optimized right) kills the single precision SSE unit in scalar mode due to the expense of the scalar vector load, and the x87 unit is slightly faster even in double precision (in scalar mode). Gcc cannot yet auto-vectorize any ATLAS kernels. Any help much appreciated, Clint -- Summary: gcc 4 produces worse x87 code on all platforms than gcc 3 Product: gcc Version: 4.1.1 Status: UNCONFIRMED Severity: blocker Priority: P3 Component: rtl-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: hiclint at gmail dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827