On Sep 14, 2005, at 9:50 PM, Andrew Pinski wrote:
On Sep 14, 2005, at 9:21 PM, Dale Johannesen wrote:
Consider the following SSE code
(-march=pentium4 -mtune=prescott -O2 -mfpmath=sse -msse2)
<4256776a.c>
The first inner loop compiles to
paddq %xmm0, %xmm1
Good. The second compiles to
movdqa %xmm2, %xmm0
paddw %xmm1, %xmm0
movdqa %xmm0, %xmm1
when it could be using a single paddw. The basic problem is that
our approach defines __m128i to be V2DI even though all the operations
on the object are V4SI, so there are a lot of subreg's that don't need
to generate code. I'd like to fix this, but am not sure how to go
about it.
From real looks of this looks more like a register allocation issue and
nothing to do with subregs at all, except subregs being there.
That's kind of an overstatement; obviously getting rid of the subregs
would
solve the problem as you can see from the first function. I think
you're right that
If we allocated 64 and 63 as the same register, it would have worked
correctly.
(you mean 64 and 66) would fix this example; I'll look at that. Having
a more
uniform representation for operations on __m128i objects would simplify
things
all over the place, though.