On 09/28/2011 05:59 AM, Artem Shinkarov wrote:
> I don't really understand this. As far as I know, expand_normal
> "converts" tree to rtx. All my computations are happening at the level
> of rtx and force_reg is needed just to bring an rtx expression to the
> register of the correct mode. If I am missing something, could you
> give an example how can I use expand_normal instead of force_reg in
> this particular code.

Sorry, I meant expand_(simple_)binop.

>> Is ssse3_pshufb why you do the wrong thing in the expander for v0 != v1?
> 
> My personal feeling is that it may be the case with v0 != v1, that it
> would be more efficient to perform piecewise shuffling rather than
> bitwise dances around the masks.

Maybe for V2DI and V2DFmode, but probably not otherwise.

We can perform the double-word shuffle in 12 insns; 10 for SSE 4.1.
Example assembly attached.

>> It's certainly possible to handle it, though it takes a few more steps,
>> and might well be more efficient as a libgcc function rather than inline.
> 
> I don't really understand why it could be more efficient. I thought
> that inline gives more chances to the final RTL optimisation.

We'll not be able to optimize this at the rtl level.  There are too many
UNSPEC instructions in the way.  In any case, even if that weren't so we'd
only be able to do useful optimization for a constant permutation.  And
we should have been able to prove that at the gimple level.


r~
        .data
        .align 16
vec3:   .long   3,3,3,3
vec4:   .long   4,4,4,4
dup4:   .byte   0,0,0,0, 4,4,4,4, 8,8,8,8, 12,12,12,12
ofs4:   .byte   0,1,2,3, 0,1,2,3, 0,1,2,3, 0,1,2,3

        .text
shuffle2:

        // Convert the low bits of the mask to a shuffle
        movdqa  %xmm2, %xmm3
        pand    vec3, %xmm3
        pmulld  vec4, %xmm3
        pshufb  dup4, %xmm3
        paddb   ofs4, %xmm3

        // Shuffle both inputs
        pshufb  %xmm3, %xmm0
        pshufb  %xmm3, %xmm1

        // Select and merge the inputs
        // Use ix86_expand_int_vcond for use of pblendvb for SSE4_1.
        pand    vec4, %xmm2
        pcmpeqd vec4, %xmm2
        pand    %xmm2, %xmm1
        pandn   %xmm2, %xmm0
        por     %xmm1, %xmm0

        ret

Reply via email to