I'm using GCC to compile some code which uses SSE intrinsics. The code is being compiled at -O3 -mfpmath=sse.
GCC decides to use MMX instructions for some of the operations (zeroing some memory). There are no MMX intrinsics in the source, but an SSE _mm_setzero_ps gets compiled into a pair of movq %mm0. No emms instruction is emitted, possibly because gcc "knows" the FPU is not in use because of the -mfpmath=sse switch, leaving the FPU in MMX mode. However, the platform ABI specifies floating-point return values be returned in FPU registers, so gcc moves return values from SSE registers to the FPU for argument passing/returning. Since the FPU is in an invalid state because of the lack of emms, this corrupts the floating-point values. This behaviour seems to be very dependent on the exact version of gcc and the exact source. Here's a testcase: #include <stdio.h> #include <emmintrin.h> // Since this is all in one file, we need to mark some functions // noinline so that the interesting parts don't get compiled away #define NOINLINE __attribute__((noinline)) struct OutputData { __m128 a, b; }; struct InputData { float a, b; }; // Something that uses an OutputData that won't be compiled away __m128 ga, gb; NOINLINE void doSomethingWith(const OutputData& d){ ga = d.a; gb = d.b; } NOINLINE void calc(const InputData& in) { OutputData out; // the next two lines are where the bug manifests // gcc decides to use MMX instructions to write // some zeros, but fails to clean up afterwards. out.a = _mm_setr_ps(in.a, in.b, 0, 0); out.b = _mm_setzero_ps(); // ensure the above is not optimised away doSomethingWith(out); } NOINLINE float retFloat() { return 3; } int main() { InputData x = {3.4, 42}; // GCC emits MMX instructions for this function, but emits no emms calc(x); // This uses the FPU to return the value, which gets corrupted printf("%f\n", retFloat()); return 0; } On my machine, this generates (for the function "calc"): .globl _Z4calcRK9InputData .type _Z4calcRK9InputData, @function _Z4calcRK9InputData: .LFB530: .cfi_startproc .cfi_personality 0x0,__gxx_personality_v0 pushl %ebp .cfi_def_cfa_offset 8 pxor %mm0, %mm0 movl %esp, %ebp .cfi_offset 5, -8 .cfi_def_cfa_register 5 subl $60, %esp movl 8(%ebp), %eax movss 4(%eax), %xmm0 movss (%eax), %xmm1 movq %mm0, -48(%ebp) ; MMX instruction movq %mm0, -24(%ebp) ; MMX instruction leal -40(%ebp), %eax movq %mm0, -16(%ebp) ; MMX instruction movl %eax, (%esp) unpcklps %xmm0, %xmm1 movaps %xmm1, %xmm0 xorps %xmm1, %xmm1 movlhps %xmm1, %xmm0 movlps %xmm0, -40(%ebp) movhps %xmm0, -32(%ebp) call _Z15doSomethingWithRK10OutputData leave ret ; FPU stack still in MMX mode and unusable for floating point .cfi_endproc .LFE530: .size _Z4calcRK9InputData, .-_Z4calcRK9InputData System: Ubuntu Lucid (gcc 4:4.4.3-1ubuntu1) GCC: Using built-in specs. Target: i486-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.4.3-4ubuntu5' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared --enable-multiarch --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4 --program-suffix=-4.4 --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-plugin --enable-objc-gc --enable-targets=all --disable-werror --with-arch-32=i486 --with-tune=generic --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu --target=i486-linux-gnu Thread model: posix gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) -- Summary: GCC generates MMX instructions but fails to generate "emms" Product: gcc Version: 4.4.3 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: stephen dot dolan at havok dot com GCC host triplet: i486-linux-gnu GCC target triplet: i486-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44578