RE: [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector equality.

Roger Sayle Wed, 12 Jul 2023 00:29:22 -0700


> From: Hongtao Liu <crazy...@gmail.com>
> Sent: 12 July 2023 01:45
> 
> On Wed, Jul 12, 2023 at 4:57 AM Roger Sayle <ro...@nextmovesoftware.com>
> > > From: Hongtao Liu <crazy...@gmail.com>
> > > Sent: 28 June 2023 04:23
> > > > From: Roger Sayle <ro...@nextmovesoftware.com>
> > > > Sent: 27 June 2023 20:28
> > > >
> > > > I've also come up with an alternate/complementary/supplementary
> > > > fix of generating the PTEST during RTL expansion, rather than rely
> > > > on this being caught/optimized later during STV.
> > > >
> > > > You may notice in this patch, the tests for TARGET_SSE4_1 and
> > > > TImode appear last.  When I was writing this, I initially also
> > > > added support for AVX VPTEST and OImode, before realizing that x86
> > > > doesn't (yet) support 256-bit OImode (which also explains why we
> > > > don't have an OImode to V1OImode scalar-to-vector pass).
> > > > Retaining this clause ordering should minimize the lines changed if 
> > > > things
> change in future.
> > > >
> > > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > > bootstrap and make -k check, both with and without
> > > > --target_board=unix{-m32} with no new failures.  Ok for mainline?
> > > >
> > > >
> > > > 2023-06-27  Roger Sayle  <ro...@nextmovesoftware.com>
> > > >
> > > > gcc/ChangeLog
> > > >         * config/i386/i386-expand.cc (ix86_expand_int_compare): If
> > > >         testing a TImode SUBREG of a 128-bit vector register against
> > > >         zero, use a PTEST instruction instead of first moving it to
> > > >         to scalar registers.
> > > >
> > >
> > > +  /* Attempt to use PTEST, if available, when testing vector modes for
> > > +     equality/inequality against zero.  */  if (op1 == const0_rtx
> > > +      && SUBREG_P (op0)
> > > +      && cmpmode == CCZmode
> > > +      && SUBREG_BYTE (op0) == 0
> > > +      && REG_P (SUBREG_REG (op0))
> > > Just register_operand (op0, TImode),
> >
> > I completely agree that in most circumstances, the early RTL
> > optimizers should use standard predicates, such as register_operand,
> > that don't distinguish between REG and SUBREG, allowing the choice
> > (assignment) to be left to register allocation (reload).
> >
> > However in this case, unusually, the presence of the SUBREG, and
> > treating it differently from a REG is critical (in fact the reason for
> > the patch).  x86_64 can very efficiently test whether a 128-bit value
> > is zero, setting ZF, either in TImode, using orq %rax,%rdx in a single
> > cycle/single instruction, or in V1TImode, using ptest %xmm0,%xmm0, in a 
> > single
> cycle/single instruction.
> > There's no reason to prefer one form over the other.  A SUREG,
> > however, that moves the value from the scalar registers to a vector
> > register, or from a vector registers to scalar registers, requires two or 
> > three
> instructions, often reading
> > and writing values via memory, at a huge performance penalty.   Hence the
> > goal is to eliminate the (VIEW_CONVERT) SUBREG, and choose the
> > appropriate single-cycle test instruction for where the data is
> > located.  Hence we want to leave REG_P alone, but optimize (only) the
> SUBREG_P cases.
> > register_operand doesn't help with this.
> >
> > Note this is counter to the usual advice.  Normally, a SUBREG between
> > scalar registers is cheap (in fact free) on x86, hence it safe for
> > predicates to ignore them prior to register allocation.  But another
> > use of SUBREG, to represent a VIEW_CONVERT_EXPR/transfer between
> > processing units is closer to a conversion, and a very expensive one
> > (going via memory with different size reads vs writes) at that.
> >
> >
> > > +      && VECTOR_MODE_P (GET_MODE (SUBREG_REG (op0)))
> > > +      && TARGET_SSE4_1
> > > +      && GET_MODE (op0) == TImode
> > > +      && GET_MODE_SIZE (GET_MODE (SUBREG_REG (op0))) == 16)
> > > +    {
> > > +      tmp = SUBREG_REG (op0);
> > > and tmp = lowpart_subreg (V1TImode, force_reg (TImode, op0));?
> > > I think RA can handle SUBREG correctly, no need for extra predicates.
> >
> > Likewise, your "tmp = lowpart_subreg (V1TImode, force_reg (TImode, ...))"
> > is forcing there to always be an inter-unit transfer/pipeline stall,
> > when this is idiom that we're trying to eliminate.
> >
> > I should have repeated the motivating example from my original post at
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622706.html
> >
> > typedef long long __m128i __attribute__ ((__vector_size__ (16))); int
> > foo (__m128i x, __m128i y) {
> >   return (__int128)x == (__int128)y;
> > }
> >
> > is currently generated as:
> > foo:    movaps  %xmm0, -40(%rsp)
> >         movq    -32(%rsp), %rdx
> >         movq    %xmm0, %rax
> >         movq    %xmm1, %rsi
> >         movaps  %xmm1, -24(%rsp)
> >         movq    -16(%rsp), %rcx
> >         xorq    %rsi, %rax
> >         xorq    %rcx, %rdx
> >         orq     %rdx, %rax
> >         sete    %al
> >         movzbl  %al, %eax
> >         ret
> >
> > with this patch (to eliminate the interunit SUBREG) this becomes:
> >
> > foo:    pxor    %xmm1, %xmm0
> >         xorl    %eax, %eax
> >         ptest   %xmm0, %xmm0
> >         sete    %al
> >         ret
> >
> > Hopefully, this clarifies things a little.
> Thanks for the explanation, the patch LGTM.


Thanks.

> One curious question, is there any case SUBREG_BYTE != 0 when inner
> and outer mode(TImode) have the same mode size(16 bytes)?

Great question. Some of this is tucked away in the e-mails quoted above.
In addition to setting the ZF flag by comparing a 128-bit vector with zero, 
x86's
ptest/vptest pattern can also be used for comparing 256-bit vectors with zero
(hypothetically) using OImode, and 512-bit vectors with zero, using XImode.
Hence mention of out mode (OImode) and mode size (32 bytes).

The hold-up is that there doesn't appear to be a portable way in C/C++ to
express "if (is_all_zero_p (vector))" without using backend specific builtins,
as vector comparisons return vector results [though if someone knows the
secret incantation, please share].  Casting to __int128 (TImode) works on
x86_64 [as in this patch], because TImode is supported, but unfortunately
OImode and XImode aren't (yet).

This is related to the interesting bug, that even with SSE, AVX or AVX512
support, it's not possible to define V1TImode with -m32, as the usual typedef:
typedef __int128 v1ti __attribute__ ((vector_size (16))
fails with the error __int128 is not supported on  this platform.  I've been
investigating whether the front-ends can peek at the vector_size attribute
to work-around the target's scalar constraints.  This might even allow for
V1OImode and V1XImode on x86_64.

But you're right, if I'm always testing that the scalar mode and the vector
mode are the same size, the SUBREG offset should (I believe) always be
zero (which on x86 corresponds to lowpart), if the SUBREG isn't paradoxical.
The apparent redundancy is for readability, defensive programming and
avoiding more expensive tests/clauses when the SUBREG_BYTE isn't zero.


Thanks again,
Roger
--

RE: [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector equality.

Reply via email to