I've tested this under gcc 4.2.1 (DJGPP) and gcc 3.4.4 (cygwin) (see below for
gcc -v outputs). The test program is simple: generate two random complex
numbers, multiply them using the FPU, and do it again using 3dnow inline
assembly. I can get it to work with different code, but this version
illustrates an apparent bug in the optimization scheme.

Compiling as gcc -m3dnow cdtest.c produces the correct output:
x = 0.122114 + 0.096877i        y = 0.372626 + 0.199503i
z = 0.026175 + 0.060461i        z = 0.026175 + 0.060461i

Compiling as above with an additional -O(1,2, or 3) produces:
x = 0.122114 + 0.096877i        y = 0.372626 + 0.199503i
z = 0.026175 + 0.060461i        z = 0.011737 + 0.018312i
with the inline-assembly (second) value wrong.

--- cdtest.c
#include <stdio.h>
#include <stdlib.h>

#define fRand()    (((float)rand())/RAND_MAX)
typedef struct fComp { float x,y; } fComp __attribute__ ((aligned (8)));
#define fCIni(d,X,Y)   do { (d)->x = (X); (d)->y = (Y); } while(0)

int main(int nArgs, char *Args[], char *Env[]) {
  fComp x,y,z; /* x,y random complex numbers, z = x*y */
  fCIni(&x,fRand(),fRand()); fCIni(&y,fRand(),fRand());
  fCIni(&z,x.x*y.x - x.y*y.y, x.x*y.y + x.y*y.x);
  printf("x = %f + %fi\ty = %f + %fi\nz = %f + %fi\t",x.x,x.y,y.x,y.y,z.x,z.y);
  /* Recalculate the same product using 3dNow */
  asm("pswapd   %[X],%[X]\n\t"    /* X = r1   |   i1 */
      "pfmul    %[Y],%[Z]\n\t"    /* Z = i1i2 | r1r2 */
      "pfmul    %[X],%[Y]\n\t"    /* Y = r1i2 | r2i1 */
      "pfpnacc  %[Y],%[Z]\n\t"    /* Z = r1i2+r2i1 | r1r2-i1i2 */
      "movq     %[Z],(%[Zp])"     /* Store result at *Zp */
      :
      : [X]"y"(x), [Y]"y"(y), [Z]"y"(x), [Zp]"r"(&z)
      : "memory" );
  asm("femms");
  printf("z = %f + %fi\n", z.x, z.y);
  return 0;
}
----

Inspecting the .s output, the unoptimized compilation assigns
[X] = %mm0, [Y] = %mm1, [Z] = %mm2,
while the optimized version fails to work because of the duplication
[X] = %mm0, [Y] = %mm1, [Z] = %mm0.
This seems to happen because both X and Z are initialized to the same value;
doing [Z]"y"(0) or something similar (and movq %[X],%[Z] manually) resolves the
problem, but with an unfortunate side-effect that %[Z] is actually set twice,
wasting instructions. It would be saner for the compiler to just automatically
re-copy the first instance to the second one, rather than erroneously
concluding that I don't really need the extra register just because it has the
same input value.

----
Using built-in specs.
Target: djgpp
Configured with: /v203/gcc-4.21/configure djgpp --prefix=/dev/env/DJDIR
--disabl
e-nls --disable-werror --enable-languages=c,c++,fortran,objc,obj-c++,ada
Thread model: single
gcc version 4.2.1
-----
Reading specs from /usr/lib/gcc/i686-pc-cygwin/3.4.4/specs
Configured with: /usr/build/package/orig/test.respin/gcc-3.4.4-3/configure
--ver
bose --prefix=/usr --exec-prefix=/usr --sysconfdir=/etc --libdir=/usr/lib
--libe
xecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info
--enable-langu
ages=c,ada,c++,d,f77,pascal,java,objc --enable-nls --without-included-gettext
--
enable-version-specific-runtime-libs --without-x --enable-libgcj
--disable-java-
awt --with-system-zlib --enable-interpreter --disable-libgcj-debug
--enable-thre
ads=posix --enable-java-gc=boehm --disable-win32-registry
--enable-sjlj-exceptio
ns --enable-hash-synchronization --enable-libstdcxx-debug
Thread model: posix
gcc version 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)
----

A similar problem occurs if I replace the first inline asm with:
  asm("movq     %[X],%[Z]\n\t"
      "pswapd   %[X],%[X]\n\t"    /* X = r1   |   i1 */
      "pfmul    %[Y],%[Z]\n\t"    /* Z = i1i2 | r1r2 */
      "pfmul    %[X],%[Y]\n\t"    /* Y = r1i2 | r2i1 */
      "pfpnacc  %[Y],%[Z]\n\t"    /* Z = r1i2+r2i1 | r1r2-i1i2 */
      : [Z]"=y"(z)
      : [X]"y"(x), [Y]"y"(y) );
both [Z] and [X] again being mapped to the same register. (This code also even
more wasteful, but that's not the point.)

----
BTW: Is there any way to tell gcc that I need a random register different from
the rest, but I don't care about its initial value--something like [foo]"y"()
in inputs or [bar]"y" in clobber? I know if I put, say "mm0" in the clobber
list, then gcc seems to know to pick different ones for [X],[Y],[Z], but
letting gcc pick one would be a convenient feature.


-- 
           Summary: Inline assembly swallows mmx register on duplicate input
                    under optimization
           Product: gcc
           Version: 4.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: inline-asm
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: stanlio at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38901

Reply via email to