On Thu, Jan 6, 2022 at 7:00 PM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > > This patch improves the code generated when moving a 128-bit value > > in TImode, represented by two 64-bit registers, to V1TImode, which > > is a single SSE register. > > > > Currently, the simple move: > > typedef unsigned __int128 uv1ti __attribute__ ((__vector_size__ (16))); > > uv1ti foo(__int128 x) { return (uv1ti)x; } > > > > is always transferred via memory, as: > > foo: movq %rdi, -24(%rsp) > > movq %rsi, -16(%rsp) > > movdqa -24(%rsp), %xmm0 > > ret > > > > with this patch, we now generate (with -msse2): > > foo: movq %rdi, %xmm1 > > movq %rsi, %xmm2 > > punpcklqdq %xmm2, %xmm1 > > movdqa %xmm1, %xmm0 > > ret > > > > and with -mavx2: > > foo: vmovq %rdi, %xmm1 > > vpinsrq $1, %rsi, %xmm1, %xmm0 > > ret > > > > Even more dramatic is the improvement of zero extended transfers. > > > > uv1ti bar(unsigned char c) { return (uv1ti)(__int128)c; } > > > > Previously generated: > > bar: movq $0, -16(%rsp) > > movzbl %dil, %eax > > movq %rax, -24(%rsp) > > vmovdqa -24(%rsp), %xmm0 > > ret > > > > Now generates: > > bar: movzbl %dil, %edi > > movq %rdi, %xmm0 > > ret > > > > > > My first attempt at this functionality attempted to use a > > simple define_split: > > > > +;; Move TImode to V1TImode via V2DImode instead of memory. > > +(define_split > > + [(set (match_operand:V1TI 0 "register_operand") > > + (subreg:V1TI (match_operand:TI 1 "register_operand") 0))] > > + "TARGET_64BIT && TARGET_SSE2 && can_create_pseudo_p ()" > > + [(set (match_dup 2) (vec_concat:V2DI (match_dup 3) (match_dup 4))) > > + (set (match_dup 0) (subreg:V1TI (match_dup 2) 0))] > > +{ > > + operands[2] = gen_reg_rtx (V2DImode); > > + operands[3] = gen_lowpart (DImode, operands[1]); > > + operands[4] = gen_highpart (DImode, operands[1]); > > +}) > > + > > > > Unfortunately, this triggers very late during the compilation > > preventing some of the simplification's we'd like (in combine). > > For example the foo case above becomes: > > > > foo: movq %rsi, -16(%rsp) > > movq %rdi, %xmm0 > > movhps -16(%rsp), %xmm0 > > > > transferring half directly, and the other half via memory. > > And for the bar case above, GCC fails to appreciate that > > movq/vmovq clears the high bits, resulting in: > > > > bar: movzbl %dil, %eax > > xorl %edx, %edx > > vmovq %rax, %xmm1 > > vpinsrq $1, %rdx, %xmm1, %xmm0 > > ret > > > > > > Hence the solution (i.e. this patch) is to add a special case > > to ix86_expand_vector_move for TImode to V1TImode transfers. > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check with no new failures. Ok for mainline? > > > > > > 2022-01-06 Roger Sayle <ro...@nextmovesoftware.com> > > > > gcc/ChangeLog > > * config/i386/i386-expand.c (ix86_expand_vector_move): Add > > special case for TImode to V1TImode moves, going via V2DImode. > > > > gcc/testsuite/ChangeLog > > * gcc.target/i386/sse2-v1ti-mov-1.c: New test case. > > * gcc.target/i386/sse2-v1ti-zext.c: New test case.
OK. Thanks, Uros.