On Sun, Oct 24, 2021 at 6:34 PM Roger Sayle <ro...@nextmovesoftware.com> wrote:
>
>
> This patch provides RTL expanders to implement logical shifts and
> rotates of 128-bit values (stored in vector integer registers) by
> constant bit counts.  Previously, GCC would transfer these values
> to a pair of scalar registers (TImode) via memory to perform the
> operation, then transfer the result back via memory.  Instead these
> operations are now expanded using (between 1 and 5) SSE2 vector
> instructions.

Hm, instead of using memory (without STL forwarding for general -> XMM
moves!) these should use something similar to what clang produces (or
use pextrq/pinsrq, at least with SSE4.1):

       movq    %xmm0, %rax
       pshufd  $78, %xmm0, %xmm0
       movq    %xmm0, %rcx
       shldq   $8, %rax, %rcx
       shlq    $8, %rax
       movq    %rcx, %xmm1
       movq    %rax, %xmm0
       punpcklqdq      %xmm1, %xmm0

> Logical shifts by multiples of 8 can be implemented using x86_64's
> pslldq/psrldq instruction:
> ashl_8: pslldq  $1, %xmm0
>         ret
> lshr_32:
>         psrldq  $4, %xmm0
>         ret
>
> Logical shifts by greater than 64 can use pslldq/psrldq $8, followed
> by a psllq/psrlq for the remaining bits:
> ashl_111:
>         pslldq  $8, %xmm0
>         psllq   $47, %xmm0
>         ret
> lshr_127:
>         psrldq  $8, %xmm0
>         psrlq   $63, %xmm0
>         ret
>
> The remaining logical shifts make use of the following idiom:
> ashl_1:
>         movdqa  %xmm0, %xmm1
>         psllq   $1, %xmm0
>         pslldq  $8, %xmm1
>         psrlq   $63, %xmm1
>         por     %xmm1, %xmm0
>         ret
> lshr_15:
>         movdqa  %xmm0, %xmm1
>         psrlq   $15, %xmm0
>         psrldq  $8, %xmm1
>         psllq   $49, %xmm1
>         por     %xmm1, %xmm0
>         ret
>
> Rotates by multiples of 32 can use x86_64's pshufd:
> rotr_32:
>         pshufd  $57, %xmm0, %xmm0
>         ret
> rotr_64:
>         pshufd  $78, %xmm0, %xmm0
>         ret
> rotr_96:
>         pshufd  $147, %xmm0, %xmm0
>         ret
>
> Rotates by multiples of 8 (other than multiples of 32) can make
> use of both pslldq and psrldq, followed by por:
> rotr_8:
>         movdqa  %xmm0, %xmm1
>         psrldq  $1, %xmm0
>         pslldq  $15, %xmm1
>         por     %xmm1, %xmm0
>         ret
> rotr_112:
>         movdqa  %xmm0, %xmm1
>         psrldq  $14, %xmm0
>         pslldq  $2, %xmm1
>         por     %xmm1, %xmm0
>         ret
>
> And the remaining rotates use one or two pshufd, followed by a
> psrld/pslld/por sequence:
> rotr_1:
>         movdqa  %xmm0, %xmm1
>         pshufd  $57, %xmm0, %xmm0
>         psrld   $1, %xmm1
>         pslld   $31, %xmm0
>         por     %xmm1, %xmm0
>         ret
> rotr_63:
>         pshufd  $78, %xmm0, %xmm1
>         pshufd  $57, %xmm0, %xmm0
>         pslld   $1, %xmm1
>         psrld   $31, %xmm0
>         por     %xmm1, %xmm0
>         ret
> rotr_111:
>         pshufd  $147, %xmm0, %xmm1
>         pslld   $17, %xmm0
>         psrld   $15, %xmm1
>         por     %xmm1, %xmm0
>         ret
>
> The new test case, sse2-v1ti-shift.c, is a run-time check to confirm that
> the results of V1TImode shifts/rotates by constants, exactly match the
> expected results of TImode operations, for various input test vectors.

Is the sequence of 4+ SSE instructions really faster than
pinsrq/pextrq (and two movq insn) + two operations on integer
registers?

Uros.

> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check with no new failures.  Ok for mainline?
>
>
> 2021-10-24  Roger Sayle  <ro...@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * config/i386/i386-expand.c (ix86_expand_v1ti_shift): New helper
>         function to expand V1TI mode logical shifts by integer constants.
>         (ix86_expand_v1ti_rotate): New helper function to expand V1TI
>         mode rotations by integer constants.
>         * config/i386/i386-protos.h (ix86_expand_v1ti_shift,
>         ix86_expand_v1ti_rotate): Prototype new functions here.
>         * config/i386/sse.md (ashlv1ti3, lshrv1ti3, rotlv1ti3, rotrv1ti3):
>         New TARGET_SSE2 expanders to implement V1TI shifts and rotations.
>
> gcc/testsuite/ChangeLog
>         * gcc.target/i386/sse2-v1ti-shift.c: New test case.
>
>
> Thanks in advance,
> Roger
> --
>

Reply via email to