Hi Diego,
Jeff's point about our optimizers is also true. Nick, remember that
issue with MIPS optimizations you were discussing with Jeff a few days
ago? I didn't follow most of the details, but it involved ivopts and
sign issues. Could you send a summary?
Sure:
I was looking at how a gcc 4.1 based compiler optimized a fast
fourier transform algorithm for the MIPS target. What I found was the
it was producing much worse code than a similarly configured gcc 3.4
based compiler, and the culprit was the creation of the induction
variables used in the inner loop.
The nested loops look like this:
signed short i,j,k,DataSizeExponent,DataSize;
signed short RealBitRevData[MAX], ImagBitRevData[MAX];
signed short * CosineV, * SineV;
for (k = 1; k <= DataSizeExponent; k++)
{
signed short n1 = 1 << k;
signed short n2 = n1 >> 1;
signed short ArgIndex = 0;
signed short DeltaIndex = (DataSize >> 1) / n2;
for (j = 0; j < n2; j++)
{
signed int WReal = CosineV[ArgIndex];
signed int WImag = SineV[ArgIndex];
ArgIndex += DeltaIndex;
for (i = j; i < DataSize; i += n1)
{
signed short l = i + n2;
signed int tRealData = (WReal * RealBitRevData[l]) +
(WImag * ImagBitRevData[l]);
signed int tImagData = (WReal * ImagBitRevData[l]) -
(WImag * RealBitRevData[l]);
tRealData = tRealData >> BUTTERFLY_SCALE_FACTOR;
tImagData = tImagData >> BUTTERFLY_SCALE_FACTOR;
RealBitRevData[l] = RealBitRevData[i] - tRealData;
ImagBitRevData[l] = ImagBitRevData[i] - tImagData;
RealBitRevData[i] += tRealData;
ImagBitRevData[i] += tImagData;
}
}
}
and the inner loop was being transformed into:
short int D.2046, D.2043;
long int D.2038, D.2035;
int D.2033, D.2024, D.2020, D.2008, D.2001;
[...]
<L5>:;
D.2033 = (int) (signed short) ivtmp.67;
D.2035 = (long int) RealBitRevData[D.2033];
D.2038 = (long int) ImagBitRevData[D.2033];
tRealData = WReal * D.2035 + WImag * D.2038;
tImagData = WReal * D.2038 - WImag * D.2035;
D.2001 = (int) i;
D.2043 = (short int) (tRealData >> 15);
RealBitRevData[D.2033] = RealBitRevData[D.2001] - D.2043;
D.2046 = (short int) (tImagData >> 15);
ImagBitRevData[D.2033] = ImagBitRevData[D.2001] - D.2046;
RealBitRevData[D.2001] = D.2043 + RealBitRevData[D.2001];
ImagBitRevData[D.2001] = D.2046 + ImagBitRevData[D.2001];
i = (signed short) ((short unsigned int) i + ivtmp.78);
ivtmp.68 = ivtmp.68 + ivtmp.78;
ivtmp.67 = ivtmp.67 + ivtmp.78;
if (DataSize > (signed short) (ivtmp.68 - ivtmp.78)) goto <L5>; else
goto <L7>;
The net result of these induction variables, and especially the type
translations necessary to go between signed and unsigned and shorts and
ints was an inner loop consisting of 42 instructions, as opposed to an
inner loop of 32 instructions as produced by the gcc 3.4 compiler.
Cheers
Nick