> From: Hongtao Liu <crazy...@gmail.com> > Sent: 28 June 2023 04:23 > > From: Roger Sayle <ro...@nextmovesoftware.com> > > Sent: 27 June 2023 20:28 > > > > I've also come up with an alternate/complementary/supplementary > > fix of generating the PTEST during RTL expansion, rather than rely on > > this being caught/optimized later during STV. > > > > You may notice in this patch, the tests for TARGET_SSE4_1 and TImode > > appear last. When I was writing this, I initially also added support > > for AVX VPTEST and OImode, before realizing that x86 doesn't (yet) > > support 256-bit OImode (which also explains why we don't have an > > OImode to V1OImode scalar-to-vector pass). Retaining this clause > > ordering should minimize the lines changed if things change in future. > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32} > > with no new failures. Ok for mainline? > > > > > > 2023-06-27 Roger Sayle <ro...@nextmovesoftware.com> > > > > gcc/ChangeLog > > * config/i386/i386-expand.cc (ix86_expand_int_compare): If > > testing a TImode SUBREG of a 128-bit vector register against > > zero, use a PTEST instruction instead of first moving it to > > to scalar registers. > > > > + /* Attempt to use PTEST, if available, when testing vector modes for > + equality/inequality against zero. */ if (op1 == const0_rtx > + && SUBREG_P (op0) > + && cmpmode == CCZmode > + && SUBREG_BYTE (op0) == 0 > + && REG_P (SUBREG_REG (op0)) > Just register_operand (op0, TImode),
I completely agree that in most circumstances, the early RTL optimizers should use standard predicates, such as register_operand, that don't distinguish between REG and SUBREG, allowing the choice (assignment) to be left to register allocation (reload). However in this case, unusually, the presence of the SUBREG, and treating it differently from a REG is critical (in fact the reason for the patch). x86_64 can very efficiently test whether a 128-bit value is zero, setting ZF, either in TImode, using orq %rax,%rdx in a single cycle/single instruction, or in V1TImode, using ptest %xmm0,%xmm0, in a single cycle/single instruction. There's no reason to prefer one form over the other. A SUREG, however, that moves the value from the scalar registers to a vector register, or from a vector registers to scalar registers, requires two or three instructions, often reading and writing values via memory, at a huge performance penalty. Hence the goal is to eliminate the (VIEW_CONVERT) SUBREG, and choose the appropriate single-cycle test instruction for where the data is located. Hence we want to leave REG_P alone, but optimize (only) the SUBREG_P cases. register_operand doesn't help with this. Note this is counter to the usual advice. Normally, a SUBREG between scalar registers is cheap (in fact free) on x86, hence it safe for predicates to ignore them prior to register allocation. But another use of SUBREG, to represent a VIEW_CONVERT_EXPR/transfer between processing units is closer to a conversion, and a very expensive one (going via memory with different size reads vs writes) at that. > + && VECTOR_MODE_P (GET_MODE (SUBREG_REG (op0))) > + && TARGET_SSE4_1 > + && GET_MODE (op0) == TImode > + && GET_MODE_SIZE (GET_MODE (SUBREG_REG (op0))) == 16) > + { > + tmp = SUBREG_REG (op0); > and tmp = lowpart_subreg (V1TImode, force_reg (TImode, op0));? > I think RA can handle SUBREG correctly, no need for extra predicates. Likewise, your "tmp = lowpart_subreg (V1TImode, force_reg (TImode, ...))" is forcing there to always be an inter-unit transfer/pipeline stall, when this is idiom that we're trying to eliminate. I should have repeated the motivating example from my original post at https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622706.html typedef long long __m128i __attribute__ ((__vector_size__ (16))); int foo (__m128i x, __m128i y) { return (__int128)x == (__int128)y; } is currently generated as: foo: movaps %xmm0, -40(%rsp) movq -32(%rsp), %rdx movq %xmm0, %rax movq %xmm1, %rsi movaps %xmm1, -24(%rsp) movq -16(%rsp), %rcx xorq %rsi, %rax xorq %rcx, %rdx orq %rdx, %rax sete %al movzbl %al, %eax ret with this patch (to eliminate the interunit SUBREG) this becomes: foo: pxor %xmm1, %xmm0 xorl %eax, %eax ptest %xmm0, %xmm0 sete %al ret Hopefully, this clarifies things a little.