On Wed, Jul 12, 2023 at 4:57 AM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > > From: Hongtao Liu <crazy...@gmail.com> > > Sent: 28 June 2023 04:23 > > > From: Roger Sayle <ro...@nextmovesoftware.com> > > > Sent: 27 June 2023 20:28 > > > > > > I've also come up with an alternate/complementary/supplementary > > > fix of generating the PTEST during RTL expansion, rather than rely on > > > this being caught/optimized later during STV. > > > > > > You may notice in this patch, the tests for TARGET_SSE4_1 and TImode > > > appear last. When I was writing this, I initially also added support > > > for AVX VPTEST and OImode, before realizing that x86 doesn't (yet) > > > support 256-bit OImode (which also explains why we don't have an > > > OImode to V1OImode scalar-to-vector pass). Retaining this clause > > > ordering should minimize the lines changed if things change in future. > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > > and make -k check, both with and without --target_board=unix{-m32} > > > with no new failures. Ok for mainline? > > > > > > > > > 2023-06-27 Roger Sayle <ro...@nextmovesoftware.com> > > > > > > gcc/ChangeLog > > > * config/i386/i386-expand.cc (ix86_expand_int_compare): If > > > testing a TImode SUBREG of a 128-bit vector register against > > > zero, use a PTEST instruction instead of first moving it to > > > to scalar registers. > > > > > > > + /* Attempt to use PTEST, if available, when testing vector modes for > > + equality/inequality against zero. */ if (op1 == const0_rtx > > + && SUBREG_P (op0) > > + && cmpmode == CCZmode > > + && SUBREG_BYTE (op0) == 0 > > + && REG_P (SUBREG_REG (op0)) > > Just register_operand (op0, TImode), > > I completely agree that in most circumstances, the early RTL optimizers > should use standard predicates, such as register_operand, that don't > distinguish between REG and SUBREG, allowing the choice (assignment) > to be left to register allocation (reload). > > However in this case, unusually, the presence of the SUBREG, and treating > it differently from a REG is critical (in fact the reason for the patch). > x86_64 > can very efficiently test whether a 128-bit value is zero, setting ZF, either > in TImode, using orq %rax,%rdx in a single cycle/single instruction, or in > V1TImode, using ptest %xmm0,%xmm0, in a single cycle/single instruction. > There's no reason to prefer one form over the other. A SUREG, however, that > moves the value from the scalar registers to a vector register, or from a > vector > registers to scalar registers, requires two or three instructions, often > reading > and writing values via memory, at a huge performance penalty. Hence the > goal is to eliminate the (VIEW_CONVERT) SUBREG, and choose the appropriate > single-cycle test instruction for where the data is located. Hence we want > to leave REG_P alone, but optimize (only) the SUBREG_P cases. > register_operand doesn't help with this. > > Note this is counter to the usual advice. Normally, a SUBREG between scalar > registers is cheap (in fact free) on x86, hence it safe for predicates to > ignore > them prior to register allocation. But another use of SUBREG, to represent > a VIEW_CONVERT_EXPR/transfer between processing units is closer to a > conversion, and a very expensive one (going via memory with different size > reads vs writes) at that. > > > > + && VECTOR_MODE_P (GET_MODE (SUBREG_REG (op0))) > > + && TARGET_SSE4_1 > > + && GET_MODE (op0) == TImode > > + && GET_MODE_SIZE (GET_MODE (SUBREG_REG (op0))) == 16) > > + { > > + tmp = SUBREG_REG (op0); > > and tmp = lowpart_subreg (V1TImode, force_reg (TImode, op0));? > > I think RA can handle SUBREG correctly, no need for extra predicates. > > Likewise, your "tmp = lowpart_subreg (V1TImode, force_reg (TImode, ...))" > is forcing there to always be an inter-unit transfer/pipeline stall, when > this is > idiom that we're trying to eliminate. > > I should have repeated the motivating example from my original post at > https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622706.html > > typedef long long __m128i __attribute__ ((__vector_size__ (16))); > int foo (__m128i x, __m128i y) { > return (__int128)x == (__int128)y; > } > > is currently generated as: > foo: movaps %xmm0, -40(%rsp) > movq -32(%rsp), %rdx > movq %xmm0, %rax > movq %xmm1, %rsi > movaps %xmm1, -24(%rsp) > movq -16(%rsp), %rcx > xorq %rsi, %rax > xorq %rcx, %rdx > orq %rdx, %rax > sete %al > movzbl %al, %eax > ret > > with this patch (to eliminate the interunit SUBREG) this becomes: > > foo: pxor %xmm1, %xmm0 > xorl %eax, %eax > ptest %xmm0, %xmm0 > sete %al > ret > > Hopefully, this clarifies things a little. Thanks for the explanation, the patch LGTM. One curious question, is there any case SUBREG_BYTE != 0 when inner and outer mode(TImode) have the same mode size(16 bytes)?
> > -- BR, Hongtao