RFA: pervasive SSE codegen inefficiency

Dale Johannesen Wed, 14 Sep 2005 18:21:53 -0700

Consider the following SSE code
(-march=pentium4 -mtune=prescott -O2 -mfpmath=sse -msse2)

#include <xmmintrin.h>
__m128i foo3(__m128i z, __m128i a, int N) {
        int i;
        for (i=0; i<N; i++) {
                a = _mm_add_epi64(z, a);
        }
        return _mm_add_epi64(a, a);
}
__m128i foo1(__m128i z, __m128i a, int N) {
        int i;
        for (i=0; i<N; i++) {
                a = _mm_add_epi16(z, a);
        }
        return _mm_add_epi16(a, a);
}



The first inner loop compiles to

        paddq   %xmm0, %xmm1

Good.  The second compiles to

        movdqa  %xmm2, %xmm0
        paddw   %xmm1, %xmm0
        movdqa  %xmm0, %xmm1

when it could be using a single paddw.  The basic problem is that
our approach defines __m128i to be V2DI even though all the operations
on the object are V4SI, so there are a lot of subreg's that don't need

to generate code. I'd like to fix this, but am not sure how to goabout it.

The pattern-matching and RTL optimizers seem quite hostile to mismatched

mode operations. If I were starting from scratch I'd define a singleV128I mode

and distinguish paddw and paddq by operation codes, or possibly by using

subreg:SSEMODEI throughout the patterns. Any less intrusive ideas?Thanks.


(ISTR some earlier discussion about this but can't find it; apologies if
I'm reopening something that shouldn't be:)

RFA: pervasive SSE codegen inefficiency

Reply via email to