I've tested this under gcc 4.2.1 (DJGPP) and gcc 3.4.4 (cygwin) (see below for gcc -v outputs). The test program is simple: generate two random complex numbers, multiply them using the FPU, and do it again using 3dnow inline assembly. I can get it to work with different code, but this version illustrates an apparent bug in the optimization scheme.
Compiling as gcc -m3dnow cdtest.c produces the correct output: x = 0.122114 + 0.096877i y = 0.372626 + 0.199503i z = 0.026175 + 0.060461i z = 0.026175 + 0.060461i Compiling as above with an additional -O(1,2, or 3) produces: x = 0.122114 + 0.096877i y = 0.372626 + 0.199503i z = 0.026175 + 0.060461i z = 0.011737 + 0.018312i with the inline-assembly (second) value wrong. --- cdtest.c #include <stdio.h> #include <stdlib.h> #define fRand() (((float)rand())/RAND_MAX) typedef struct fComp { float x,y; } fComp __attribute__ ((aligned (8))); #define fCIni(d,X,Y) do { (d)->x = (X); (d)->y = (Y); } while(0) int main(int nArgs, char *Args[], char *Env[]) { fComp x,y,z; /* x,y random complex numbers, z = x*y */ fCIni(&x,fRand(),fRand()); fCIni(&y,fRand(),fRand()); fCIni(&z,x.x*y.x - x.y*y.y, x.x*y.y + x.y*y.x); printf("x = %f + %fi\ty = %f + %fi\nz = %f + %fi\t",x.x,x.y,y.x,y.y,z.x,z.y); /* Recalculate the same product using 3dNow */ asm("pswapd %[X],%[X]\n\t" /* X = r1 | i1 */ "pfmul %[Y],%[Z]\n\t" /* Z = i1i2 | r1r2 */ "pfmul %[X],%[Y]\n\t" /* Y = r1i2 | r2i1 */ "pfpnacc %[Y],%[Z]\n\t" /* Z = r1i2+r2i1 | r1r2-i1i2 */ "movq %[Z],(%[Zp])" /* Store result at *Zp */ : : [X]"y"(x), [Y]"y"(y), [Z]"y"(x), [Zp]"r"(&z) : "memory" ); asm("femms"); printf("z = %f + %fi\n", z.x, z.y); return 0; } ---- Inspecting the .s output, the unoptimized compilation assigns [X] = %mm0, [Y] = %mm1, [Z] = %mm2, while the optimized version fails to work because of the duplication [X] = %mm0, [Y] = %mm1, [Z] = %mm0. This seems to happen because both X and Z are initialized to the same value; doing [Z]"y"(0) or something similar (and movq %[X],%[Z] manually) resolves the problem, but with an unfortunate side-effect that %[Z] is actually set twice, wasting instructions. It would be saner for the compiler to just automatically re-copy the first instance to the second one, rather than erroneously concluding that I don't really need the extra register just because it has the same input value. ---- Using built-in specs. Target: djgpp Configured with: /v203/gcc-4.21/configure djgpp --prefix=/dev/env/DJDIR --disabl e-nls --disable-werror --enable-languages=c,c++,fortran,objc,obj-c++,ada Thread model: single gcc version 4.2.1 ----- Reading specs from /usr/lib/gcc/i686-pc-cygwin/3.4.4/specs Configured with: /usr/build/package/orig/test.respin/gcc-3.4.4-3/configure --ver bose --prefix=/usr --exec-prefix=/usr --sysconfdir=/etc --libdir=/usr/lib --libe xecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --enable-langu ages=c,ada,c++,d,f77,pascal,java,objc --enable-nls --without-included-gettext -- enable-version-specific-runtime-libs --without-x --enable-libgcj --disable-java- awt --with-system-zlib --enable-interpreter --disable-libgcj-debug --enable-thre ads=posix --enable-java-gc=boehm --disable-win32-registry --enable-sjlj-exceptio ns --enable-hash-synchronization --enable-libstdcxx-debug Thread model: posix gcc version 3.4.4 (cygming special, gdc 0.12, using dmd 0.125) ---- A similar problem occurs if I replace the first inline asm with: asm("movq %[X],%[Z]\n\t" "pswapd %[X],%[X]\n\t" /* X = r1 | i1 */ "pfmul %[Y],%[Z]\n\t" /* Z = i1i2 | r1r2 */ "pfmul %[X],%[Y]\n\t" /* Y = r1i2 | r2i1 */ "pfpnacc %[Y],%[Z]\n\t" /* Z = r1i2+r2i1 | r1r2-i1i2 */ : [Z]"=y"(z) : [X]"y"(x), [Y]"y"(y) ); both [Z] and [X] again being mapped to the same register. (This code also even more wasteful, but that's not the point.) ---- BTW: Is there any way to tell gcc that I need a random register different from the rest, but I don't care about its initial value--something like [foo]"y"() in inputs or [bar]"y" in clobber? I know if I put, say "mm0" in the clobber list, then gcc seems to know to pick different ones for [X],[Y],[Z], but letting gcc pick one would be a convenient feature. -- Summary: Inline assembly swallows mmx register on duplicate input under optimization Product: gcc Version: 4.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: inline-asm AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: stanlio at gmail dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38901