On 09/28/2011 05:59 AM, Artem Shinkarov wrote: > I don't really understand this. As far as I know, expand_normal > "converts" tree to rtx. All my computations are happening at the level > of rtx and force_reg is needed just to bring an rtx expression to the > register of the correct mode. If I am missing something, could you > give an example how can I use expand_normal instead of force_reg in > this particular code.
Sorry, I meant expand_(simple_)binop. >> Is ssse3_pshufb why you do the wrong thing in the expander for v0 != v1? > > My personal feeling is that it may be the case with v0 != v1, that it > would be more efficient to perform piecewise shuffling rather than > bitwise dances around the masks. Maybe for V2DI and V2DFmode, but probably not otherwise. We can perform the double-word shuffle in 12 insns; 10 for SSE 4.1. Example assembly attached. >> It's certainly possible to handle it, though it takes a few more steps, >> and might well be more efficient as a libgcc function rather than inline. > > I don't really understand why it could be more efficient. I thought > that inline gives more chances to the final RTL optimisation. We'll not be able to optimize this at the rtl level. There are too many UNSPEC instructions in the way. In any case, even if that weren't so we'd only be able to do useful optimization for a constant permutation. And we should have been able to prove that at the gimple level. r~
.data .align 16 vec3: .long 3,3,3,3 vec4: .long 4,4,4,4 dup4: .byte 0,0,0,0, 4,4,4,4, 8,8,8,8, 12,12,12,12 ofs4: .byte 0,1,2,3, 0,1,2,3, 0,1,2,3, 0,1,2,3 .text shuffle2: // Convert the low bits of the mask to a shuffle movdqa %xmm2, %xmm3 pand vec3, %xmm3 pmulld vec4, %xmm3 pshufb dup4, %xmm3 paddb ofs4, %xmm3 // Shuffle both inputs pshufb %xmm3, %xmm0 pshufb %xmm3, %xmm1 // Select and merge the inputs // Use ix86_expand_int_vcond for use of pblendvb for SSE4_1. pand vec4, %xmm2 pcmpeqd vec4, %xmm2 pand %xmm2, %xmm1 pandn %xmm2, %xmm0 por %xmm1, %xmm0 ret