Consider the following SSE code
(-march=pentium4 -mtune=prescott -O2 -mfpmath=sse -msse2)
#include <xmmintrin.h>
__m128i foo3(__m128i z, __m128i a, int N) {
        int i;
        for (i=0; i<N; i++) {
                a = _mm_add_epi64(z, a);
        }
        return _mm_add_epi64(a, a);
}
__m128i foo1(__m128i z, __m128i a, int N) {
        int i;
        for (i=0; i<N; i++) {
                a = _mm_add_epi16(z, a);
        }
        return _mm_add_epi16(a, a);
}


The first inner loop compiles to

        paddq   %xmm0, %xmm1

Good.  The second compiles to

        movdqa  %xmm2, %xmm0
        paddw   %xmm1, %xmm0
        movdqa  %xmm0, %xmm1

when it could be using a single paddw.  The basic problem is that
our approach defines __m128i to be V2DI even though all the operations
on the object are V4SI, so there are a lot of subreg's that don't need
to generate code. I'd like to fix this, but am not sure how to go about it.
The pattern-matching and RTL optimizers seem quite hostile to mismatched
mode operations. If I were starting from scratch I'd define a single V128I mode
and distinguish paddw and paddq by operation codes, or possibly by using
subreg:SSEMODEI throughout the patterns. Any less intrusive ideas? Thanks.

(ISTR some earlier discussion about this but can't find it; apologies if
I'm reopening something that shouldn't be:)

Reply via email to