Consider the following SSE code
(-march=pentium4 -mtune=prescott -O2 -mfpmath=sse -msse2)
#include <xmmintrin.h>
__m128i foo3(__m128i z, __m128i a, int N) {
int i;
for (i=0; i<N; i++) {
a = _mm_add_epi64(z, a);
}
return _mm_add_epi64(a, a);
}
__m128i foo1(__m128i z, __m128i a, int N) {
int i;
for (i=0; i<N; i++) {
a = _mm_add_epi16(z, a);
}
return _mm_add_epi16(a, a);
}
The first inner loop compiles to
paddq %xmm0, %xmm1
Good. The second compiles to
movdqa %xmm2, %xmm0
paddw %xmm1, %xmm0
movdqa %xmm0, %xmm1
when it could be using a single paddw. The basic problem is that
our approach defines __m128i to be V2DI even though all the operations
on the object are V4SI, so there are a lot of subreg's that don't need
to generate code. I'd like to fix this, but am not sure how to go
about it.
The pattern-matching and RTL optimizers seem quite hostile to mismatched
mode operations. If I were starting from scratch I'd define a single
V128I mode
and distinguish paddw and paddq by operation codes, or possibly by using
subreg:SSEMODEI throughout the patterns. Any less intrusive ideas?
Thanks.
(ISTR some earlier discussion about this but can't find it; apologies if
I'm reopening something that shouldn't be:)