RE: [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector equality.

Roger Sayle Tue, 11 Jul 2023 13:57:40 -0700


> From: Hongtao Liu <crazy...@gmail.com>
> Sent: 28 June 2023 04:23
> > From: Roger Sayle <ro...@nextmovesoftware.com>
> > Sent: 27 June 2023 20:28
> >
> > I've also come up with an alternate/complementary/supplementary
> > fix of generating the PTEST during RTL expansion, rather than rely on
> > this being caught/optimized later during STV.
> >
> > You may notice in this patch, the tests for TARGET_SSE4_1 and TImode
> > appear last.  When I was writing this, I initially also added support
> > for AVX VPTEST and OImode, before realizing that x86 doesn't (yet)
> > support 256-bit OImode (which also explains why we don't have an
> > OImode to V1OImode scalar-to-vector pass).  Retaining this clause
> > ordering should minimize the lines changed if things change in future.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> >
> >
> > 2023-06-27  Roger Sayle  <ro...@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         * config/i386/i386-expand.cc (ix86_expand_int_compare): If
> >         testing a TImode SUBREG of a 128-bit vector register against
> >         zero, use a PTEST instruction instead of first moving it to
> >         to scalar registers.
> >
> 
> +  /* Attempt to use PTEST, if available, when testing vector modes for
> +     equality/inequality against zero.  */  if (op1 == const0_rtx
> +      && SUBREG_P (op0)
> +      && cmpmode == CCZmode
> +      && SUBREG_BYTE (op0) == 0
> +      && REG_P (SUBREG_REG (op0))
> Just register_operand (op0, TImode),


I completely agree that in most circumstances, the early RTL optimizers
should use standard predicates, such as register_operand, that don't
distinguish between REG and SUBREG, allowing the choice (assignment)
to be left to register allocation (reload).

However in this case, unusually, the presence of the SUBREG, and treating
it differently from a REG is critical (in fact the reason for the patch).  
x86_64
can very efficiently test whether a 128-bit value is zero, setting ZF, either
in TImode, using orq %rax,%rdx in a single cycle/single instruction, or in
V1TImode, using ptest %xmm0,%xmm0, in a single cycle/single instruction.
There's no reason to prefer one form over the other.  A SUREG, however, that
moves the value from the scalar registers to a vector register, or from a vector
registers to scalar registers, requires two or three instructions, often reading
and writing values via memory, at a huge performance penalty.   Hence the
goal is to eliminate the (VIEW_CONVERT) SUBREG, and choose the appropriate
single-cycle test instruction for where the data is located.  Hence we want
to leave REG_P alone, but optimize (only) the SUBREG_P cases.
register_operand doesn't help with this.

Note this is counter to the usual advice.  Normally, a SUBREG between scalar
registers is cheap (in fact free) on x86, hence it safe for predicates to ignore
them prior to register allocation.  But another use of SUBREG, to represent
a VIEW_CONVERT_EXPR/transfer between processing units is closer to a
conversion, and a very expensive one (going via memory with different size
reads vs writes) at that.


> +      && VECTOR_MODE_P (GET_MODE (SUBREG_REG (op0)))
> +      && TARGET_SSE4_1
> +      && GET_MODE (op0) == TImode
> +      && GET_MODE_SIZE (GET_MODE (SUBREG_REG (op0))) == 16)
> +    {
> +      tmp = SUBREG_REG (op0);
> and tmp = lowpart_subreg (V1TImode, force_reg (TImode, op0));?
> I think RA can handle SUBREG correctly, no need for extra predicates.

Likewise, your "tmp = lowpart_subreg (V1TImode, force_reg (TImode, ...))"
is forcing there to always be an inter-unit transfer/pipeline stall, when this 
is
idiom that we're trying to eliminate.

I should have repeated the motivating example from my original post at
https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622706.html

typedef long long __m128i __attribute__ ((__vector_size__ (16)));
int foo (__m128i x, __m128i y) {
  return (__int128)x == (__int128)y;
}

is currently generated as:
foo:    movaps  %xmm0, -40(%rsp)
        movq    -32(%rsp), %rdx
        movq    %xmm0, %rax
        movq    %xmm1, %rsi
        movaps  %xmm1, -24(%rsp)
        movq    -16(%rsp), %rcx
        xorq    %rsi, %rax
        xorq    %rcx, %rdx
        orq     %rdx, %rax
        sete    %al
        movzbl  %al, %eax
        ret

with this patch (to eliminate the interunit SUBREG) this becomes:

foo:    pxor    %xmm1, %xmm0
        xorl    %eax, %eax
        ptest   %xmm0, %xmm0
        sete    %al
        ret

Hopefully, this clarifies things a little.

RE: [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector equality.

Reply via email to