gcc version: powerpc-apple-darwin8.11.0-gcc-4.4.0 (GCC) 4.4.0 20090116 (experimental) version is a macports (formerly darwin ports) build of gcc4.4.0 on an OSX 10.4.11 ppc7450 host
Following C function produces different code depending on the use of 'loop_Ai' vs 'direct_assignment_Ai' snippets: float a[4][4] __attribute__ ((aligned (16))); float b[4][4] __attribute__ ((aligned (16))); float c[4][4] __attribute__ ((aligned (16))); inline static void mmul( float (&c)[4][4], const float (&a)[4][4], const float (&b)[4][4]) { // iterate by product's rows for (unsigned i = 0; i < 4; i++) { register float ai[4][4]; // swizzle each element of the i-th row of A into a full-dimensional vector for (unsigned j = 0; j < 4; j++) // direct_assignment_Ai: /* ai[j][0] = ai[j][1] = ai[j][2] = ai[j][3] = a[i][j]; */ // loop_Ai: for (unsigned k = 0; k < 4; k++) ai[j][k] = a[i][j]; // multiply the first element of the i-th row of A by the first row of B for (unsigned k = 0; k < 4; k++) { c[i][k] = ai[0][k] * b[0][k]; } // multiply-add all subsequent elements of the i-th row of A by their respective rows of B for (unsigned j = 1; j < 4; j++) { for (unsigned k = 0; k < 4; k++) { c[i][k] += ai[j][k] * b[j][k]; } } } } /code Observed ~10% performance degradation when using 'loop_Ai' instead of 'direct_assignment_Ai'. From what I can tell, the differences in the generated ppc code constitute mainly instruction scheduling. Following optimization-related compiler options were used for the test: -fno-exceptions -fno-rtti -faltivec -maltivec -mtune=7450 -O3 -fmessage-length=0 -funroll-loops -ffast-math -fstrict-aliasing -ftree-vectorize -ftree-vectorizer-verbose=3 -fvisibility=hidden -fvisibility-inlines-hidden -fno-threadsafe-statics Full test app code and resulting .s files available upon request. For the record, the intended vectorization fails, so resulting code is entirely scalar, but it is rich on fused multiply-add's. -martin