On Wed, Apr 20, 2022 at 8:28 PM Roger Sayle <ro...@nextmovesoftware.com> wrote:
>
>
> Doh! ENOPATCH.
>
> > -----Original Message-----
> > From: Roger Sayle <ro...@nextmovesoftware.com>
> > Sent: 20 April 2022 18:50
> > To: 'gcc-patches@gcc.gnu.org' <gcc-patches@gcc.gnu.org>
> > Subject: [x86 PATCH] Improved V1TI (and V2DI) mode equality/inequality.
> >
> >
> > This patch (for when the compiler returns to stage 1) improves support for
> > vector equality and inequality of V1TImode vectors, and V2DImode vectors
> with
> > sse2 but not sse4.  Consider the three functions below:
> >
> > typedef unsigned int uv4si __attribute__ ((__vector_size__ (16))); typedef
> > unsigned long long uv2di __attribute__ ((__vector_size__ (16))); typedef
> > unsigned __int128 uv1ti __attribute__ ((__vector_size__ (16)));
> >
> > uv4si eq_v4si(uv4si x, uv4si y) { return x == y; } uv2di eq_v2di(uv2di x,
> uv2di y) {
> > return x == y; } uv1ti eq_v1ti(uv1ti x, uv1ti y) { return x == y; }
> >
> > These all perform vector comparisons of 128bit SSE2 registers, generating
> the
> > result as a vector, where ~0 (all 1 bits) represents true and a zero
> represents
> > false.  eq_v4si is trivially implemented by x86_64's pcmpeqd instruction.
> This
> > patch improves the other two cases:
> >
> > For v2di, gcc -O2 currently generates:
> >
> >         movq    %xmm0, %rdx
> >         movq    %xmm1, %rax
> >         movdqa  %xmm0, %xmm2
> >         cmpq    %rax, %rdx
> >         movhlps %xmm2, %xmm3
> >         movhlps %xmm1, %xmm4
> >         sete    %al
> >         movq    %xmm3, %rdx
> >         movzbl  %al, %eax
> >         negq    %rax
> >         movq    %rax, %xmm0
> >         movq    %xmm4, %rax
> >         cmpq    %rax, %rdx
> >         sete    %al
> >         movzbl  %al, %eax
> >         negq    %rax
> >         movq    %rax, %xmm5
> >         punpcklqdq      %xmm5, %xmm0
> >         ret
> >
> > but with this patch we now generate:
> >
> >         pcmpeqd %xmm0, %xmm1
> >         pshufd  $177, %xmm1, %xmm0
> >         pand    %xmm1, %xmm0
> >         ret
> >
> > where the results of a V4SI comparison are shuffled and bit-wise ANDed to
> > produce the desired result.  There's no change in the code generated for
> "-O2 -
> > msse4" where the compiler generates a single "pcmpeqq" insn.
> >
> > For V1TI mode, the results are equally dramatic, where the current -O2
> output
> > looks like:
> >
> >         movaps  %xmm0, -40(%rsp)
> >         movq    -40(%rsp), %rax
> >         movq    -32(%rsp), %rdx
> >         movaps  %xmm1, -24(%rsp)
> >         movq    -24(%rsp), %rcx
> >         movq    -16(%rsp), %rsi
> >         xorq    %rcx, %rax
> >         xorq    %rsi, %rdx
> >         orq     %rdx, %rax
> >         sete    %al
> >         xorl    %edx, %edx
> >         movzbl  %al, %eax
> >         negq    %rax
> >         adcq    $0, %rdx
> >         movq    %rax, %xmm2
> >         negq    %rdx
> >         movq    %rdx, -40(%rsp)
> >         movhps  -40(%rsp), %xmm2
> >         movdqa  %xmm2, %xmm0
> >         ret
> >
> > with this patch we now generate:
> >
> >         pcmpeqd %xmm0, %xmm1
> >         pshufd  $177, %xmm1, %xmm0
> >         pand    %xmm1, %xmm0
> >         pshufd  $78, %xmm0, %xmm1
> >         pand    %xmm1, %xmm0
> >         ret
> >
> > performing a V2DI comparison, followed by a shuffle and pand, and with
> > -O2 -msse4 take advantages of SSE4.1's pcmpeqq:
> >
> >         pcmpeqq %xmm0, %xmm1
> >         pshufd  $78, %xmm1, %xmm0
> >         pand    %xmm1, %xmm0
> >         ret
> >
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and
> > make -k check, both with and without --target_board=unix{-m32}, with no
> new
> > failures.  Is this OK for when we return to stage 1?
> >
> >
> > 2022-04-20  Roger Sayle  <ro...@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >       * config/i386/sse.md (vec_cmpeqv2div2di): Enable for TARGET_SSE2.
> >       For !TARGET_SSE4_1, expand as a V4SI vector comparison, followed
> >       by a pshufd and pand.
> >       (vec_cmpeqv1tiv1ti): New define_expand implementing V1TImode
> >       vector equality as a V2DImode vector comparison (see above),
> >       followed by a pshufd and pand.
> >
> > gcc/testsuite/ChangeLog
> >       * gcc.target/i386/sse2-v1ti-veq.c: New test case.
> >       * gcc.target/i386/sse2-v1ti-vne.c: New test case.
> >


+  bool ok;
+  if (!TARGET_SSE4_1)
+    {
+      rtx ops[4];
+      ops[0] = gen_reg_rtx (V4SImode);
+      ops[2] = force_reg (V4SImode, gen_lowpart (V4SImode, operands[2]));
+      ops[3] = force_reg (V4SImode, gen_lowpart (V4SImode, operands[3]));

In general, this is better written as e.g.:

gen_lowpart (V4SImode, force_reg (V2DImode, operands[2]))

This ensures that we get a subreg of V2DImode register, and avoids
problems with gen_lowpart. Also, other expander functions should be
prepared to handle subregs, so in

+  rtx tmp2 = force_reg (V4SImode, gen_lowpart (V4SImode, dst));
+  emit_insn (gen_sse2_pshufd (tmp1, tmp2, GEN_INT (0x4e)));

forcing a subreg to a register before the call to gen_sse2_pshufd is
not needed, since dst is already a register.

Uros.

Reply via email to