Just to review, the second function here was the problem:
(-march=pentium4 -mtune=prescott -O2 -mfpmath=sse -msse2)
#include <xmmintrin.h>
__m128i foo3(__m128i z, __m128i a, int N) {
int i;
for (i=0; i<N; i++) {
a = _mm_add_epi64(z, a);
}
return _mm_add_epi64(a, a);
}
__m128i foo1(__m128i z, __m128i a, int N) {
int i;
for (i=0; i<N; i++) {
a = _mm_add_epi16(z, a);
}
return _mm_add_epi16(a, a);
}
where the inner loop compiles to
movdqa %xmm2, %xmm0
paddw %xmm1, %xmm0
movdqa %xmm0, %xmm1
instead of a single paddw. Response was that I should look at the
register allocator.
OK. Rtl coming in looks like:
R70:v8hi <- R59:v8hi + subreg:v8hi (R66:v2di)
R66:v2di <- subreg:v2di(R70:v8hi)
where R70 is used only in these 2 insns, and R66 is live on entry and
exit to the loop.
First, local-alloc picks a hard reg (R21) for R70. Global has some
code that tries to assign
R66 to the same hard regs as things that R66 is copied to
(copy_preference); that code
doesn't look under subregs, so isn't triggered in this rtl. It's
straightforward to extend this
code to look under subregs, and that works for this example. (Although
just which subregs
are safe to look under will require more attention than I've given it,
if we want this in.)
However, that's not the whole problem. When we have two accumulators
in the loop:
#include <xmmintrin.h>
__m128i foo1(__m128i z, __m128i a, __m128i b, int N) {
int i;
for (i=0; i<N; i++) {
a = _mm_add_epi16(z, a);
b = _mm_add_epi16(z, b);
}
return _mm_add_epi16(a,b);
}
R70:v8hi <- R59:v8hi + subreg:v8hi (R66:v2di)
R66:v2di <- subreg:v2di(R70:v8hi)
R72:v8hi <- R61:v8hi + subreg:v8hi (R68:v2di)
R68:v2di <- subreg:v2di(R72:v8hi)
local-alloc assigns the same reg (R21) to R70 and R72. This means R21
conflicts with
both R66 and R68, so is not considered for either of them, and the
copy_preference
optimization isn't invoked. I don't see a way to fix that in global.
Doing round-robin
allocation in local-alloc would alleviate that...for a while, until the
block gets big
enough that registers are reused; that's not a complete solution.
Really I don't think this is an RA problem at all. We ought to be able
to combine these
patterns no matter what the RA does. The following pattern makes
combine do it:
(define_insn "*addmixed<mode>3"
[(set (match_operand:V2DI 0 "register_operand" "=x")
(subreg:V2DI (plus:SSEMODE124
(match_operand:SSEMODE124 2 "nonimmediate_operand" "xm")
(subreg:SSEMODE124 (match_operand:V2DI 1 "nonimmediate_operand" "%0")
0)) 0))]
"TARGET_SSE2 && ix86_binary_operator_ok (PLUS, <MODE>mode, operands)"
"padd<ssevecsize>\t{%2, %0|%0, %2}"
[(set_attr "type" "sseiadd")
(set_attr "mode" "TI")])
I'm not very happy about this because it's really not an x86 problem
either, at least in
theory, but flushing the problem down to the RA doesn't look
profitable. Comments?