On Thu, Jun 6, 2024 at 6:07 PM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > Hi Hongtao, > Here's the third revision of my improved ternlog handling patch for x86. > This addresses the previously discovered problems, adding a check for > memory_operand, and adds four new test cases, to confirm that the > appropriate functionality is being triggered/covered, including a test > case for the example you reported requiring the memory_operand fix. > [Thanks to Alexander Monakov for suggesting I use my ternlog benchmark > as a coverage testcase.] > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check, both with and without --target_board=unix{-m32} > with no new failures. Ok for mainline? Ok.
BTW with -march=cascadelake, I notice there're new failures, most of them can be fixed by adjusting ix86_rtx_cost(to recognize ix86_ternlog_operand_p). gcc: gcc.target/i386/avx2-pr98461.c scan-assembler-times \tnotl\t 6 gcc: gcc.target/i386/avx512f-copysign.c scan-assembler-times vpternlog[dq][ \\t]+\\$(?:216|228|0xd8|0xe4), 5 gcc: gcc.target/i386/pr101989-broadcast-1.c scan-assembler-times \\{1to4\\} 4 gcc: gcc.target/i386/sse2-v1ti-vne.c scan-assembler-times pcmpeq 6 unix/-m32: gcc: gcc.target/i386/avx2-pr98461.c scan-assembler-times \tnotl\t 6 unix/-m32: gcc: gcc.target/i386/pr101989-broadcast-1.c scan-assembler-times \\{1to4\\} 4 New tests that FAIL (6 tests): gcc: gcc.target/i386/avx512f-vpternlogd-3.c scan-assembler-times vpternlogd[ \\t] 694 gcc: gcc.target/i386/avx512f-vpternlogd-4.c scan-assembler-times vpternlogd[ \\t] 694 unix/-m32: gcc: gcc.target/i386/avx512f-vpternlogd-3.c scan-assembler-times vpternlogd[ \\t] 694 unix/-m32: gcc: gcc.target/i386/avx512f-vpternlogd-4.c scan-assembler-times vpternlogd[ \\t] 694 > > > 2024-06-06 Roger Sayle <ro...@nextmovesoftware.com> > Hongtao Liu <hongtao....@intel.com> > > gcc/ChangeLog > * config/i386/i386-expand.cc (ix86_expand_args_builtin): Call > fixup_modeless_constant before testing predicates. Only call > copy_to_mode_reg on memory operands (after the first one). > (ix86_gen_bcst_mem): Helper function to convert a CONST_VECTOR > into a VEC_DUPLICATE if possible. > (ix86_ternlog_idx): Convert an RTX expression into a ternlog > index between 0 and 255, recording the operands in ARGS, if > possible or return -1 if this is not possible/valid. > (ix86_ternlog_leaf_p): Helper function to identify "leaves" > of a ternlog expression, e.g. REG_P, MEM_P, CONST_VECTOR, etc. > (ix86_ternlog_operand_p): Test whether a expression is suitable > for and prefered as an UNSPEC_TERNLOG. > (ix86_expand_ternlog_binop): Helper function to construct the > binary operation corresponding to a sufficiently simple ternlog. > (ix86_expand_ternlog_andnot): Helper function to construct a > ANDN operation corresponding to a sufficiently simple ternlog. > (ix86_expand_ternlog): Expand a 3-operand ternary logic > expression, constructing either an UNSPEC_TERNLOG or simpler > rtx expression. Called from builtin expanders and pre-reload > splitters. > * config/i386/i386-protos.h (ix86_ternlog_idx): Prototype here. > (ix86_ternlog_operand_p): Likewise. > (ix86_expand_ternlog): Likewise. > * config/i386/predicates.md (ternlog_operand): New predicate > that calls xi86_ternlog_operand_p. > * config/i386/sse.md (<avx512>_vpternlog<mode>_0): New > define_insn_and_split that recognizes a SET_SRC of ternlog_operand > and expands it via ix86_expand_ternlog pre-reload. > (<avx512>_vternlog<mode>_mask): Convert from define_insn to > define_expand. Use ix86_expand_ternlog if the mask operand is > ~0 (or 255 or -1). > (*<avx512>_vternlog<mode>_mask): define_insn renamed from above. > > gcc/testsuite/ChangeLog > * gcc.target/i386/avx512f-vpternlogd-1.c: Update test case. > * gcc.target/i386/avx512f-vpternlogq-1.c: Likewise. > * gcc.target/i386/avx512vl-vpternlogd-1.c: Likewise. > * gcc.target/i386/avx512vl-vpternlogq-1.c: Likewise. > * gcc.target/i386/pr100711-4.c: Likewise. > * gcc.target/i386/pr100711-5.c: Likewise. > > * gcc.target/i386/avx512f-vpternlogd-3.c: New 128-bit test case. > * gcc.target/i386/avx512f-vpternlogd-4.c: New 256-bit test case. > * gcc.target/i386/avx512f-vpternlogd-5.c: New 512-bit test case. > * gcc.target/i386/avx512f-vpternlogq-3.c: New test case. > > > Thanks in advance, > Roger > > > -----Original Message----- > > From: Hongtao Liu <crazy...@gmail.com> > > On Mon, May 27, 2024 at 2:48 PM Hongtao Liu <crazy...@gmail.com> wrote: > > > > > > On Sat, May 18, 2024 at 4:10 AM Roger Sayle <ro...@nextmovesoftware.com> > > wrote: > > > > > > > > > > > > Hi Hongtao, > > > > Many thanks for the review, bug fixes and suggestions for improvements. > > > > This revised version of the patch, implements all of your > > > > corrections. In theory the "ternlog idx" should guarantee that some > > > > operands are non-null, but I agree that it's better defensive > > > > programming to > > check invariants not easily proved. > > > > Instead of calling ix86_expand_vector_move, I use > > > > ix86_broadcast_from_constant to achieve the same effect of using a > > > > broadcast when possible, but has the benefit of still using a memory > > > > operand (instead of a vector load) when broadcasting isn't possible. > > > > There are other places that could benefit from the same trick, but I > > > > can address these in a follow-up patch (it may even be preferrable > > > > to keep these as CONST_VECTOR during early RTL passes and lower to > > broadcast or constant pool using splitters). > > > > > > > > This revised patch has been tested on x86_64-pc-linux-gnu with make > > > > bootstrap and make -k check, both with and without > > > > --target_board=unix{-m32} with no new failures. Ok for mainline? > > > 1 file changed, 41 insertions(+) > > > gcc/config/i386/i386-expand.cc | 41 > > > +++++++++++++++++++++++++++++++++++++++++ > > > > > > modified gcc/config/i386/i386-expand.cc > > > @@ -25579,14 +25579,22 @@ ix86_gen_bcst_mem (machine_mode mode, rtx > > x) > > > && !CONST_DOUBLE_P (cst) > > > && !CONST_FIXED_P (cst)) > > > return NULL_RTX; > > > + /* I think VALID_BCST_MODE_P should be sufficient to > > > + make sure cst is CONST_INT or CONST_DOUBLE. */ > > > > > > int n_elts = GET_MODE_NUNITS (mode); > > > if (CONST_VECTOR_NUNITS (x) != n_elts) > > > return NULL_RTX; > > > + /* Do we need this? I saw from caller side there's already > > > + if (GET_MODE (op2) != mode) > > > + op2 = gen_lowpart (mode, op2); > > > + tmp2 = ix86_gen_bcst_mem (mode, op2); */ > > > + > > > > > > for (int i = 1; i < n_elts; i++) > > > if (!rtx_equal_p (cst, CONST_VECTOR_ELT (x, i))) > > > return NULL_RTX; > > > + /* CONST_VECTOR_DUPLICATE_P (op)? */ > > > > > > rtx mem = force_const_mem (GET_MODE_INNER (mode), cst); > > > return gen_rtx_VEC_DUPLICATE (mode, validize_mem (mem)); @@ > > > -25709,6 +25717,21 @@ ix86_ternlog_idx (rtx op, rtx *args) > > > || ix86_ternlog_idx (XVECEXP (op, 0, 2), args) != 0xaa) > > > return -1; > > > return INTVAL (XVECEXP (op, 0, 3)); > > > + /* I think we can add some testcase for this. > > > + .i.e > > > + #include <immintrin.h> > > > + > > > + __m256i > > > + foo (__m256i a, __m256i b, __m256i c) { return (a & > > > + _mm256_ternarylogic_epi64 (a, b, c, 0xe4)); } > > > + > > > + __m256i > > > + foo1 (__m256i a, __m256i b, __m256i c) { return (b & > > > + _mm256_ternarylogic_epi64 (a, b, c, 0xe4)); } */ > > > > > > default: > > > return -1; > > > @@ -25778,6 +25801,8 @@ ix86_ternlog_operand_p (rtx op) > > > if (ix86_ternlog_leaf_p (XEXP (op, 0), mode) > > > && (ix86_ternlog_leaf_p (op1, mode) > > > || vector_all_ones_operand (op1, mode))) > > > + /* There's CONST_VECTOR check in x86_ternlog_leaf_p, > > > + so vector_all_ones_operand is not needed. */ > > > return false; > > > break; > > > > > > @@ -25862,6 +25887,10 @@ ix86_expand_ternlog (machine_mode mode, rtx > > > op0, rtx op1, rtx op2, int idx, > > > if ((!op0 || !side_effects_p (op0)) > > > && (!op1 || !side_effects_p (op1)) > > > && (!op2 || !side_effects_p (op2))) > > > + /* I think only op2 needs to check side_effects_p, op0 > > > + and op1 must be register operand when it exists, no need for > > > side_effects_p? > > > + Similar for all below side_effects_p (op0/op1) > > > + the check is redundant. */ > > > { > > > emit_move_insn (target, CONST0_RTX (mode)); > > > return target; > > > @@ -25872,6 +25901,9 @@ ix86_expand_ternlog (machine_mode mode, rtx > > > op0, rtx op1, rtx op2, int idx, > > > if ((!op1 || !side_effects_p (op1)) > > > && op0 && register_operand (op0, mode) > > > && op2 && register_operand (op2, mode)) > > > + /* op0/op1 must be register_operand when it exists, > > > + so register_operand (op0/op1, mode) is not needed. > > > + similar for all below register_operand (op0/op1, mode). */ > > > return ix86_expand_ternlog_andnot (mode, op0, op2, target); > > > break; > > > > > > @@ -25879,6 +25911,7 @@ ix86_expand_ternlog (machine_mode mode, rtx > > > op0, rtx op1, rtx op2, int idx, > > > if ((!op2 || !side_effects_p (op2)) > > > && op0 && register_operand (op0, mode) > > > && op1 && register_operand (op1, mode)) > > > + /* op0 && op1? */ > > > return ix86_expand_ternlog_andnot (mode, op0, op1, target); > > > break; > > > > > > @@ -25948,6 +25981,7 @@ ix86_expand_ternlog (machine_mode mode, rtx > > > op0, rtx op1, rtx op2, int idx, > > > if ((!op0 || !side_effects_p (op0)) > > > && (!op1 || !side_effects_p (op1)) > > > && op2) > > > + /* if (op2). */ > > > { > > > if (GET_MODE (op2) != mode) > > > op2 = gen_lowpart (mode, op2); > > > @@ -25961,18 +25995,21 @@ ix86_expand_ternlog (machine_mode mode, rtx > > > op0, rtx op1, rtx op2, int idx, > > > case 0x5a: /* a^c */ > > > if (op0 && op2 > > > && (!op1 || !side_effects_p (op1))) > > > + /* if (op0 && op2). */ > > > return ix86_expand_ternlog_binop (XOR, mode, op0, op2, target); > > > break; > > > > > > case 0x66: /* b^c */ > > > if ((!op0 || !side_effects_p (op0)) > > > && op1 && op2) > > > + /* if (op1 && op2). */ > > > return ix86_expand_ternlog_binop (XOR, mode, op1, op2, target); > > > break; > > > > > > case 0x88: /* b&c */ > > > if ((!op0 || !side_effects_p (op0)) > > > && op1 && op2) > > > + /* if (op1 && op2). */ > > > return ix86_expand_ternlog_binop (AND, mode, op1, op2, target); > > > break; > > > > > > @@ -26054,6 +26091,9 @@ ix86_expand_ternlog (machine_mode mode, rtx > > > op0, rtx op1, rtx op2, int idx, > > > } > > > > > > tmp0 = register_operand (op0, mode) ? op0 : force_reg (mode, op0); > > > + /* Do you observe there're cases of op0 not register_operand?. > > > + if it's from <avx512>_vternlog<mode>_mask, it must be > > > register_operand. > > > + if it's from ix86_ternlog_idx, it must REG_P. */ > > > if (GET_MODE (tmp0) != mode) > > > tmp0 = gen_lowpart (mode, tmp0); > > > > > > @@ -26061,6 +26101,7 @@ ix86_expand_ternlog (machine_mode mode, rtx > > > op0, rtx op1, rtx op2, int idx, > > > tmp1 = copy_rtx (tmp0); > > > else if (!register_operand (op1, mode)) > > > tmp1 = force_reg (mode, op1); > > > + /* Ditto. */ > > > else > > > tmp1 = op1; > > > if (GET_MODE (tmp1) != mode) > > > > > > > > > > > > > > > -- > > > BR, > > > Hongtao > > > > Got ICE for below testcase > > > > #include <immintrin.h> > > __m256i > > foo2 (__m256i** a, __m256i b) > > { > > return ~(**a); > > } > > > > with -march=x86-64-v4 -O2 > > > > (insn 17 7 13 2 (set (reg:V4DI 103 [ _5 ]) > > (xor:V4DI (mem:V4DI (mem/f:DI (reg:DI 105) [1 *a_4(D)+0 S8 > > A64]) [0 *_1+0 S32 A256]) > > (const_vector:V4DI [ > > (const_int -1 [0xffffffffffffffff]) repeated x4 > > ]))) "test.c":7:10 -1 > > (expr_list:REG_DEAD (reg:DI 105) > > (nil))) > > during RTL pass: ira > > > > I think we need to check memory_operand in ix86_ternlog_idx > > > > case MEM: > > if (MEM_P (op) > > && MEM_VOLATILE_P (op) > > && !volatile_ok) > > return -1; > > /* FALLTHRU */ > > > > -- > > BR, > > Hongtao -- BR, Hongtao