Just to review, the second function here was the problem:
(-march=pentium4 -mtune=prescott -O2 -mfpmath=sse -msse2)
#include <xmmintrin.h>
__m128i foo3(__m128i z, __m128i a, int N) {
        int i;
        for (i=0; i<N; i++) {
                a = _mm_add_epi64(z, a);
        }
        return _mm_add_epi64(a, a);
}
__m128i foo1(__m128i z, __m128i a, int N) {
        int i;
        for (i=0; i<N; i++) {
                a = _mm_add_epi16(z, a);
        }
        return _mm_add_epi16(a, a);
}


where the inner loop compiles to

        movdqa  %xmm2, %xmm0
        paddw   %xmm1, %xmm0
        movdqa  %xmm0, %xmm1

instead of a single paddw. Response was that I should look at the register allocator.
OK.  Rtl coming in looks like:

R70:v8hi  <-  R59:v8hi + subreg:v8hi (R66:v2di)
R66:v2di <- subreg:v2di(R70:v8hi)

where R70 is used only in these 2 insns, and R66 is live on entry and exit to the loop. First, local-alloc picks a hard reg (R21) for R70. Global has some code that tries to assign R66 to the same hard regs as things that R66 is copied to (copy_preference); that code doesn't look under subregs, so isn't triggered in this rtl. It's straightforward to extend this code to look under subregs, and that works for this example. (Although just which subregs are safe to look under will require more attention than I've given it, if we want this in.)

However, that's not the whole problem. When we have two accumulators in the loop:

#include <xmmintrin.h>

__m128i foo1(__m128i z, __m128i a, __m128i b, int N) {
        int i;
        for (i=0; i<N; i++) {
                a = _mm_add_epi16(z, a);
                b = _mm_add_epi16(z, b);
        }
        return _mm_add_epi16(a,b);
}


R70:v8hi  <-  R59:v8hi + subreg:v8hi (R66:v2di)
R66:v2di <- subreg:v2di(R70:v8hi)
R72:v8hi  <-  R61:v8hi + subreg:v8hi (R68:v2di)
R68:v2di <- subreg:v2di(R72:v8hi)

local-alloc assigns the same reg (R21) to R70 and R72. This means R21 conflicts with both R66 and R68, so is not considered for either of them, and the copy_preference optimization isn't invoked. I don't see a way to fix that in global. Doing round-robin allocation in local-alloc would alleviate that...for a while, until the block gets big
enough that registers are reused; that's not a complete solution.

Really I don't think this is an RA problem at all. We ought to be able to combine these patterns no matter what the RA does. The following pattern makes combine do it:

(define_insn "*addmixed<mode>3"
  [(set (match_operand:V2DI 0 "register_operand" "=x")
        (subreg:V2DI (plus:SSEMODE124
          (match_operand:SSEMODE124 2 "nonimmediate_operand" "xm")
          (subreg:SSEMODE124 (match_operand:V2DI 1 "nonimmediate_operand" "%0") 
0)) 0))]
  "TARGET_SSE2 && ix86_binary_operator_ok (PLUS, <MODE>mode, operands)"
  "padd<ssevecsize>\t{%2, %0|%0, %2}"
  [(set_attr "type" "sseiadd")
   (set_attr "mode" "TI")])


I'm not very happy about this because it's really not an x86 problem either, at least in theory, but flushing the problem down to the RA doesn't look profitable. Comments?

Reply via email to