Hi Diego,

Jeff's point about our optimizers is also true.  Nick, remember that
issue with MIPS optimizations you were discussing with Jeff a few days
ago?  I didn't follow most of the details, but it involved ivopts and
sign issues.  Could you send a summary?

Sure:

I was looking at how a gcc 4.1 based compiler optimized a fast fourier transform algorithm for the MIPS target. What I found was the it was producing much worse code than a similarly configured gcc 3.4 based compiler, and the culprit was the creation of the induction variables used in the inner loop.

The nested loops look like this:

    signed short i,j,k,DataSizeExponent,DataSize;
    signed short RealBitRevData[MAX], ImagBitRevData[MAX];
    signed short * CosineV, * SineV;

    for (k = 1; k <= DataSizeExponent; k++)
    {
        signed short n1 = 1 << k;
        signed short n2 = n1 >> 1;
        signed short ArgIndex = 0;
        signed short DeltaIndex = (DataSize >> 1) / n2;

        for (j = 0; j < n2; j++)
        {
            signed int WReal = CosineV[ArgIndex];
            signed int WImag = SineV[ArgIndex];

            ArgIndex += DeltaIndex;
            for (i = j; i < DataSize; i += n1)
            {
                signed short l = i + n2;
signed int tRealData = (WReal * RealBitRevData[l]) + (WImag * ImagBitRevData[l]); signed int tImagData = (WReal * ImagBitRevData[l]) - (WImag * RealBitRevData[l]);

                tRealData = tRealData >> BUTTERFLY_SCALE_FACTOR;
                tImagData = tImagData >> BUTTERFLY_SCALE_FACTOR;
                RealBitRevData[l] = RealBitRevData[i] - tRealData;
                ImagBitRevData[l] = ImagBitRevData[i] - tImagData;
                RealBitRevData[i] += tRealData;
                ImagBitRevData[i] += tImagData;
            }
        }
    }

and the inner loop was being transformed into:

  short int D.2046, D.2043;
  long int D.2038, D.2035;
  int D.2033, D.2024,  D.2020, D.2008, D.2001;
   [...]

<L5>:;
  D.2033 = (int) (signed short) ivtmp.67;
  D.2035 = (long int) RealBitRevData[D.2033];
  D.2038 = (long int) ImagBitRevData[D.2033];
  tRealData = WReal * D.2035 + WImag * D.2038;
  tImagData = WReal * D.2038 - WImag * D.2035;
  D.2001 = (int) i;
  D.2043 = (short int) (tRealData >> 15);
  RealBitRevData[D.2033] = RealBitRevData[D.2001] - D.2043;
  D.2046 = (short int) (tImagData >> 15);
  ImagBitRevData[D.2033] = ImagBitRevData[D.2001] - D.2046;
  RealBitRevData[D.2001] = D.2043 + RealBitRevData[D.2001];
  ImagBitRevData[D.2001] = D.2046 + ImagBitRevData[D.2001];
  i = (signed short) ((short unsigned int) i + ivtmp.78);
  ivtmp.68 = ivtmp.68 + ivtmp.78;
  ivtmp.67 = ivtmp.67 + ivtmp.78;
if (DataSize > (signed short) (ivtmp.68 - ivtmp.78)) goto <L5>; else goto <L7>;


The net result of these induction variables, and especially the type translations necessary to go between signed and unsigned and shorts and ints was an inner loop consisting of 42 instructions, as opposed to an inner loop of 32 instructions as produced by the gcc 3.4 compiler.

Cheers
  Nick

Reply via email to