On Sun, Oct 24, 2021 at 6:34 PM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > This patch provides RTL expanders to implement logical shifts and > rotates of 128-bit values (stored in vector integer registers) by > constant bit counts. Previously, GCC would transfer these values > to a pair of scalar registers (TImode) via memory to perform the > operation, then transfer the result back via memory. Instead these > operations are now expanded using (between 1 and 5) SSE2 vector > instructions.
Hm, instead of using memory (without STL forwarding for general -> XMM moves!) these should use something similar to what clang produces (or use pextrq/pinsrq, at least with SSE4.1): movq %xmm0, %rax pshufd $78, %xmm0, %xmm0 movq %xmm0, %rcx shldq $8, %rax, %rcx shlq $8, %rax movq %rcx, %xmm1 movq %rax, %xmm0 punpcklqdq %xmm1, %xmm0 > Logical shifts by multiples of 8 can be implemented using x86_64's > pslldq/psrldq instruction: > ashl_8: pslldq $1, %xmm0 > ret > lshr_32: > psrldq $4, %xmm0 > ret > > Logical shifts by greater than 64 can use pslldq/psrldq $8, followed > by a psllq/psrlq for the remaining bits: > ashl_111: > pslldq $8, %xmm0 > psllq $47, %xmm0 > ret > lshr_127: > psrldq $8, %xmm0 > psrlq $63, %xmm0 > ret > > The remaining logical shifts make use of the following idiom: > ashl_1: > movdqa %xmm0, %xmm1 > psllq $1, %xmm0 > pslldq $8, %xmm1 > psrlq $63, %xmm1 > por %xmm1, %xmm0 > ret > lshr_15: > movdqa %xmm0, %xmm1 > psrlq $15, %xmm0 > psrldq $8, %xmm1 > psllq $49, %xmm1 > por %xmm1, %xmm0 > ret > > Rotates by multiples of 32 can use x86_64's pshufd: > rotr_32: > pshufd $57, %xmm0, %xmm0 > ret > rotr_64: > pshufd $78, %xmm0, %xmm0 > ret > rotr_96: > pshufd $147, %xmm0, %xmm0 > ret > > Rotates by multiples of 8 (other than multiples of 32) can make > use of both pslldq and psrldq, followed by por: > rotr_8: > movdqa %xmm0, %xmm1 > psrldq $1, %xmm0 > pslldq $15, %xmm1 > por %xmm1, %xmm0 > ret > rotr_112: > movdqa %xmm0, %xmm1 > psrldq $14, %xmm0 > pslldq $2, %xmm1 > por %xmm1, %xmm0 > ret > > And the remaining rotates use one or two pshufd, followed by a > psrld/pslld/por sequence: > rotr_1: > movdqa %xmm0, %xmm1 > pshufd $57, %xmm0, %xmm0 > psrld $1, %xmm1 > pslld $31, %xmm0 > por %xmm1, %xmm0 > ret > rotr_63: > pshufd $78, %xmm0, %xmm1 > pshufd $57, %xmm0, %xmm0 > pslld $1, %xmm1 > psrld $31, %xmm0 > por %xmm1, %xmm0 > ret > rotr_111: > pshufd $147, %xmm0, %xmm1 > pslld $17, %xmm0 > psrld $15, %xmm1 > por %xmm1, %xmm0 > ret > > The new test case, sse2-v1ti-shift.c, is a run-time check to confirm that > the results of V1TImode shifts/rotates by constants, exactly match the > expected results of TImode operations, for various input test vectors. Is the sequence of 4+ SSE instructions really faster than pinsrq/pextrq (and two movq insn) + two operations on integer registers? Uros. > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check with no new failures. Ok for mainline? > > > 2021-10-24 Roger Sayle <ro...@nextmovesoftware.com> > > gcc/ChangeLog > * config/i386/i386-expand.c (ix86_expand_v1ti_shift): New helper > function to expand V1TI mode logical shifts by integer constants. > (ix86_expand_v1ti_rotate): New helper function to expand V1TI > mode rotations by integer constants. > * config/i386/i386-protos.h (ix86_expand_v1ti_shift, > ix86_expand_v1ti_rotate): Prototype new functions here. > * config/i386/sse.md (ashlv1ti3, lshrv1ti3, rotlv1ti3, rotrv1ti3): > New TARGET_SSE2 expanders to implement V1TI shifts and rotations. > > gcc/testsuite/ChangeLog > * gcc.target/i386/sse2-v1ti-shift.c: New test case. > > > Thanks in advance, > Roger > -- >