Re: RFA: pervasive SSE codegen inefficiency

Dale Johannesen Mon, 19 Sep 2005 17:19:30 -0700

Just to review, the second function here was the problem:
(-march=pentium4 -mtune=prescott -O2 -mfpmath=sse -msse2)

#include <xmmintrin.h>
__m128i foo3(__m128i z, __m128i a, int N) {
        int i;
        for (i=0; i<N; i++) {
                a = _mm_add_epi64(z, a);
        }
        return _mm_add_epi64(a, a);
}
__m128i foo1(__m128i z, __m128i a, int N) {
        int i;
        for (i=0; i<N; i++) {
                a = _mm_add_epi16(z, a);
        }
        return _mm_add_epi16(a, a);
}



where the inner loop compiles to

        movdqa  %xmm2, %xmm0
        paddw   %xmm1, %xmm0
        movdqa  %xmm0, %xmm1

instead of a single paddw. Response was that I should look at theregister allocator.

OK.  Rtl coming in looks like:

R70:v8hi  <-  R59:v8hi + subreg:v8hi (R66:v2di)
R66:v2di <- subreg:v2di(R70:v8hi)

where R70 is used only in these 2 insns, and R66 is live on entry andexit to the loop.First, local-alloc picks a hard reg (R21) for R70. Global has somecode that tries to assignR66 to the same hard regs as things that R66 is copied to(copy_preference); that codedoesn't look under subregs, so isn't triggered in this rtl. It'sstraightforward to extend thiscode to look under subregs, and that works for this example. (Althoughjust which subregsare safe to look under will require more attention than I've given it,if we want this in.)

However, that's not the whole problem. When we have two accumulatorsin the loop:

#include <xmmintrin.h>

__m128i foo1(__m128i z, __m128i a, __m128i b, int N) {
        int i;
        for (i=0; i<N; i++) {
                a = _mm_add_epi16(z, a);
                b = _mm_add_epi16(z, b);
        }
        return _mm_add_epi16(a,b);
}



R70:v8hi  <-  R59:v8hi + subreg:v8hi (R66:v2di)
R66:v2di <- subreg:v2di(R70:v8hi)
R72:v8hi  <-  R61:v8hi + subreg:v8hi (R68:v2di)
R68:v2di <- subreg:v2di(R72:v8hi)

local-alloc assigns the same reg (R21) to R70 and R72. This means R21conflicts withboth R66 and R68, so is not considered for either of them, and thecopy_preferenceoptimization isn't invoked. I don't see a way to fix that in global.Doing round-robinallocation in local-alloc would alleviate that...for a while, until theblock gets big

enough that registers are reused; that's not a complete solution.

Really I don't think this is an RA problem at all. We ought to be ableto combine thesepatterns no matter what the RA does. The following pattern makescombine do it:

(define_insn "*addmixed<mode>3"
  [(set (match_operand:V2DI 0 "register_operand" "=x")
        (subreg:V2DI (plus:SSEMODE124
          (match_operand:SSEMODE124 2 "nonimmediate_operand" "xm")
          (subreg:SSEMODE124 (match_operand:V2DI 1 "nonimmediate_operand" "%0") 
0)) 0))]
  "TARGET_SSE2 && ix86_binary_operator_ok (PLUS, <MODE>mode, operands)"
  "padd<ssevecsize>\t{%2, %0|%0, %2}"
  [(set_attr "type" "sseiadd")
   (set_attr "mode" "TI")])

I'm not very happy about this because it's really not an x86 problemeither, at least intheory, but flushing the problem down to the RA doesn't lookprofitable. Comments?

Re: RFA: pervasive SSE codegen inefficiency

Reply via email to