gcc 4.4.0 loop-unrolling optimizations peculiarity observed

martin krastev Wed, 28 Jan 2009 20:41:08 -0800

gcc version: powerpc-apple-darwin8.11.0-gcc-4.4.0 (GCC) 4.4.0 20090116
(experimental)
version is a macports (formerly darwin ports) build of gcc4.4.0 on an
OSX 10.4.11 ppc7450 host


Following C function produces different code depending on the use of
'loop_Ai' vs 'direct_assignment_Ai' snippets:

float a[4][4] __attribute__ ((aligned (16)));
float b[4][4] __attribute__ ((aligned (16)));

float c[4][4] __attribute__ ((aligned (16)));

inline static void
mmul(
    float (&c)[4][4],
    const float (&a)[4][4],
    const float (&b)[4][4])
{
    // iterate by product's rows
    for (unsigned i = 0; i < 4; i++)
    {
        register float ai[4][4];

        // swizzle each element of the i-th row of A into a
full-dimensional vector
        for (unsigned j = 0; j < 4; j++)

// direct_assignment_Ai:
/*          ai[j][0] = ai[j][1] = ai[j][2] = ai[j][3] = a[i][j];
*/
// loop_Ai:
            for (unsigned k = 0; k < 4; k++)
                ai[j][k] = a[i][j];

        // multiply the first element of the i-th row of A by the first row of B
        for (unsigned k = 0; k < 4; k++)
        {
            c[i][k] = ai[0][k] * b[0][k];
        }

        // multiply-add all subsequent elements of the i-th row of A
by their respective rows of B
        for (unsigned j = 1; j < 4; j++)
        {
            for (unsigned k = 0; k < 4; k++)
            {
                c[i][k] += ai[j][k] * b[j][k];
            }
        }
    }
}

/code

Observed ~10% performance degradation when using 'loop_Ai' instead of
'direct_assignment_Ai'. From what I can tell, the differences in the
generated ppc code constitute mainly instruction scheduling.

Following optimization-related compiler options were used for the test:

-fno-exceptions -fno-rtti -faltivec -maltivec -mtune=7450 -O3
-fmessage-length=0 -funroll-loops -ffast-math -fstrict-aliasing
-ftree-vectorize -ftree-vectorizer-verbose=3 -fvisibility=hidden
-fvisibility-inlines-hidden -fno-threadsafe-statics

Full test app code and resulting .s files available upon request. For
the record, the intended vectorization fails, so resulting code is
entirely scalar, but it is rich on fused multiply-add's.

-martin

gcc 4.4.0 loop-unrolling optimizations peculiarity observed

Reply via email to