Hello Evandro!
x87 registers. In contrast, x86_64 ABI specifies that FP values are passed in SSE registers, so they avoid costly SSE reg->stack moves. Until i386 ABI (together with supporting math functions) is changed to something similar to x86_64, use of -mfpmath=sse won't show all its power.
Actually, in many cases, SSE did help x86 performance as well. That happens in FP-intensive applications which spend a lot of time in loops when the XMM register set can be used more efficiently than the x87 stack.
There is an annoying piece of code attached to PR19780 (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780), a loop that shuffles registers around a lot: int i; real v1x, v1y, v1z; real v2x, v2y, v2z; real v3x, v3y, v3z; for (i = 0; i < 100000000; i++) { v3x = v1y * v2z - v1z * v2y; v3y = v1z * v2x - v1x * v2z; v3z = v1x * v2y - v1y * v2x; v1x = v2x; v1y = v2y; v1z = v2z; v2x = v3x; v2y = v3y; v2z = v3z; } This code could be a perfect example how XMM register file beats x87 reg stack. However, contrary to all expectations, x87 code is 20% faster(!!) /on p4, but it would be interesting to see this comparison on x86_64, or perhaps on 32bit AMD/. The code structure, produced with -mfpmath=sse, is the same as the code structure produced with -mfpmath=x87, so IMO there is no register allocator effects in play. I was trying to look into this problem, but on first sight, code seems optimal to me... Uros.