Re: RFA: pervasive SSE codegen inefficiency

Dale Johannesen Thu, 15 Sep 2005 11:07:27 -0700


On Sep 14, 2005, at 9:50 PM, Andrew Pinski wrote:

On Sep 14, 2005, at 9:21 PM, Dale Johannesen wrote:

Consider the following SSE code
(-march=pentium4 -mtune=prescott -O2 -mfpmath=sse -msse2)
<4256776a.c>


The first inner loop compiles to

        paddq   %xmm0, %xmm1

Good.  The second compiles to

        movdqa  %xmm2, %xmm0
        paddw   %xmm1, %xmm0
        movdqa  %xmm0, %xmm1

when it could be using a single paddw.  The basic problem is that
our approach defines __m128i to be V2DI even though all the operations
on the object are V4SI, so there are a lot of subreg's that don't need

to generate code. I'd like to fix this, but am not sure how to goabout it.


From real looks of this looks more like a register allocation issue and
nothing to do with subregs at all, except subregs being there.

That's kind of an overstatement; obviously getting rid of the subregswouldsolve the problem as you can see from the first function. I thinkyou're right that

If we allocated 64 and 63 as the same register, it would have workedcorrectly.

(you mean 64 and 66) would fix this example; I'll look at that. Havinga moreuniform representation for operations on __m128i objects would simplifythings

all over the place, though.

Re: RFA: pervasive SSE codegen inefficiency

Reply via email to