On Tue, Aug 20, 2024 at 3:08 PM Tamar Christina <tamar.christ...@arm.com> wrote: > > Hi All, > > IVOPTS normally uses affine trees to perform comparisons between different > IVs, > but these seem to have been missing in two key spots and instead normal tree > equivalencies used. > > In some cases where we have a two-complements equivalence but not a strict > signedness equivalencies we end up generating both a signed and unsigned IV > for > the same candidate. > > This patch implements a new OEP flag called OEP_STRUCTURAL_EQ. This flag will > check if the operands would produce the same bit values after the computations > even if the final sign is different.
I think the name is badly chosen - we already have OEP_LEXICOGRAPHIC and OEP_BITWISE. I would suggest to use OEP_ASSUME_TWOS_COMPLEMENT or OEP_ASSUME_WRAPV. > This happens quite a lot with fortran but can also happen in C because this > came > code is unable to figure out when one expression is a multiple of another. > > As an example in the attached testcase we get: > > Initial set of candidates: > cost: 24 (complexity 3) > reg_cost: 9 > cand_cost: 15 > cand_group_cost: 0 (complexity 3) > candidates: 1, 6, 8 > group:0 --> iv_cand:6, cost=(0,1) > group:1 --> iv_cand:1, cost=(0,0) > group:2 --> iv_cand:8, cost=(0,1) > group:3 --> iv_cand:8, cost=(0,1) > invariant variables: 6 > invariant expressions: 1, 2 > > <Invariant Expressions>: > inv_expr 1: stride.3_27 * 4 > inv_expr 2: (unsigned long) stride.3_27 * 4 > > These end up being used in the same group: > > Group 1: > cand cost compl. inv.expr. inv.vars > 1 0 0 NIL; 6 > 2 0 0 NIL; 6 > 3 0 0 NIL; 6 > > which ends up with IV opts picking the signed and unsigned IVs: > > Improved to: > cost: 24 (complexity 3) > reg_cost: 9 > cand_cost: 15 > cand_group_cost: 0 (complexity 3) > candidates: 1, 6, 8 > group:0 --> iv_cand:6, cost=(0,1) > group:1 --> iv_cand:1, cost=(0,0) > group:2 --> iv_cand:8, cost=(0,1) > group:3 --> iv_cand:8, cost=(0,1) > invariant variables: 6 > invariant expressions: 1, 2 > > and so generates the same IV as both signed and unsigned: > > ;; basic block 21, loop depth 3, count 214748368 (estimated locally, freq > 58.2545), maybe hot > ;; prev block 28, next block 31, flags: (NEW, REACHABLE, VISITED) > ;; pred: 28 [always] count:23622320 (estimated locally, freq > 6.4080) (FALLTHRU,EXECUTABLE) > ;; 25 [always] count:191126046 (estimated locally, freq > 51.8465) (FALLTHRU,DFS_BACK,EXECUTABLE) > # .MEM_66 = PHI <.MEM_34(28), .MEM_22(25)> > # ivtmp.22_41 = PHI <0(28), ivtmp.22_82(25)> > # ivtmp.26_51 = PHI <ivtmp.26_55(28), ivtmp.26_72(25)> > # ivtmp.28_90 = PHI <ivtmp.28_99(28), ivtmp.28_98(25)> > > ... > > ;; basic block 24, loop depth 3, count 214748366 (estimated locally, freq > 58.2545), maybe hot > ;; prev block 22, next block 25, flags: (NEW, REACHABLE, VISITED)' > ;; pred: 22 [always] count:95443719 (estimated locally, freq > 25.8909) (FALLTHRU) > ;; 21 [33.3% (guessed)] count:71582790 (estimated locally, > freq 19.4182) (TRUE_VALUE,EXECUTABLE) > ;; 31 [33.3% (guessed)] count:47721860 (estimated locally, > freq 12.9455) (TRUE_VALUE,EXECUTABLE) > # .MEM_22 = PHI <.MEM_44(22), .MEM_31(21), .MEM_79(31)> > ivtmp.22_82 = ivtmp.22_41 + 1; > ivtmp.26_72 = ivtmp.26_51 + _80; > ivtmp.28_98 = ivtmp.28_90 + _39; > > These two IVs are always used as unsigned, so IV ops generates: > > _73 = stride.3_27 * 4; > _80 = (unsigned long) _73; > _54 = (unsigned long) stride.3_27; > _39 = _54 * 4; > > Which means that in e.g. exchange2 we generate a lot of duplicate code. > > This is because candidate 6 and 8 are equivalent under two's complement but > have > different signs. > > This patch changes it so that if you have two IVs that are affine equivalent > to > just pick one over the other. IV already has code for this, so the patch just > uses affine trees instead of tree for the check. > > With it we get: > > <Invariant Expressions>: > inv_expr 1: stride.3_27 * 4 > > <Group-candidate Costs>: > Group 0: > cand cost compl. inv.expr. inv.vars > 5 0 2 NIL; NIL; > 6 0 3 NIL; NIL; > > Group 1: > cand cost compl. inv.expr. inv.vars > 1 0 0 NIL; 6 > 2 0 0 NIL; 6 > 3 0 0 NIL; 6 > 4 0 0 NIL; 6 > > Initial set of candidates: > cost: 16 (complexity 3) > reg_cost: 6 > cand_cost: 10 > cand_group_cost: 0 (complexity 3) > candidates: 1, 6 > group:0 --> iv_cand:6, cost=(0,3) > group:1 --> iv_cand:1, cost=(0,0) > invariant variables: 6 > invariant expressions: 1 > > The two patches together results in a 10% performance increase in exchange2 in > SPECCPU 2017 and a 4% reduction in binary size and a 5% improvement in compile > time. There's also a 5% performance improvement in fotonik3d and similar > reduction in binary size. > > Bootstrapped Regtested on aarch64-none-linux-gnu, > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu > -m32, -m64 and no issues. > > Ok for master? > > Thanks, > Tamar > > gcc/ChangeLog: > > PR tree-optimization/114932 > * fold-const.cc (operand_compare::operand_equal_p): Use it. > (operand_compare::verify_hash_value): Likewise. > (test_operand_equality::test): New. > (fold_const_cc_tests): Use it. > * tree-core.h (enum operand_equal_flag): Add OEP_STRUCTURAL_EQ. > * tree-ssa-loop-ivopts.cc (record_group_use): Check for structural eq. > > gcc/testsuite/ChangeLog: > > PR tree-optimization/114932 > * gfortran.dg/addressing-modes_2.f90: New test. > > --- > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc > index > 71e82b1d76d4106c7c23c54af8b35905a1af9f1c..52a4465de0355220befa6d6b3301e21e038e3e41 > 100644 > --- a/gcc/fold-const.cc > +++ b/gcc/fold-const.cc > @@ -3208,7 +3208,19 @@ operand_compare::operand_equal_p (tree type0, > const_tree arg0, > return tree_int_cst_equal (arg0, arg1); > } > > - if (!(flags & OEP_ADDRESS_OF)) > + if ((flags & OEP_STRUCTURAL_EQ) > + && (CONVERT_EXPR_P (arg0) || CONVERT_EXPR_P (arg1))) > + { > + const_tree t_arg0 = arg0; > + const_tree t_arg1 = arg1; > + STRIP_NOPS (arg0); > + STRIP_NOPS (arg1); > + /* Only recurse if the conversion was one that was valid to strip. */ > + if (t_arg0 != arg0 || t_arg1 != arg1) > + return operand_equal_p (type0, arg0, type1, arg1, flags); (*) > + } > + > + if (!(flags & (OEP_ADDRESS_OF | OEP_STRUCTURAL_EQ))) > { > /* If both types don't have the same signedness, then we can't consider > them equal. We must check this before the STRIP_NOPS calls > @@ -3439,6 +3451,22 @@ operand_compare::operand_equal_p (tree type0, > const_tree arg0, > if (flags & (OEP_ONLY_CONST | OEP_BITWISE)) > return false; > > + /* Check if we are checking an operation where the two's compliment bitwise > + representation of the result is not the same between signed and > + unsigned arithmetic. */ > + switch (TREE_CODE (arg0)) > + { > + case PLUS_EXPR: > + case MINUS_EXPR: > + case MULT_EXPR: > + break; > + > + default: > + /* Clear the structural comparison bit. */ > + flags &= ~OEP_STRUCTURAL_EQ; So this handling means that a / (int)((unsigned)b - (unsigned)c)) and a / (b - c) are not equal with respect to OEP_STRUCTURAL_EQ because we drop that flag when processing the division and thus the recursion on the minus operands doesn't have it any more. So shouldn't this and (*) be integrated with the TYPE_UNSIGNED compare in the !OEP_ADDRESS_OF case? Aka not require same signedness for PLUS/MINUS/MULT/CONVERT? I think we have to document what the TYPE, ARG combination means semantically. As far as I understand TYPE doesn't specify the signedness of the operation ARG but instead represents a possibly stripped conversion of the result? -- There's /* Don't handle more cases for OEP_BITWISE, since we can't guarantee that two instances of undefined behavior will give identical results. */ if (flags & (OEP_ONLY_CONST | OEP_BITWISE)) return false; which is true - so I think we need to document what using OEP_STRUCTURAL_EQ means, in particular that it does not mean you can replace ARG0 with ARG1 without externally guaranteeing you are not introducing new UB. I do wonder about say comparing (int)((unsigned)a + (unsigned)b) + (c + d) with (a + b) + (int)((unsigned)c + (unsigned)d) for example - we can replace both with a all-unsigned computation or replace one with the other when both would have been always executed. I think this needs to be clearly documented. > + break; > + } > + > /* Define macros to test an operand from arg0 and arg1 for equality and a > variant that allows null and views null as being different from any > non-null value. In the latter case, if either is null, the both > @@ -4218,7 +4246,7 @@ operand_compare::verify_hash_value (const_tree arg0, > const_tree arg1, > { > if (operand_equal_p (arg0, arg1, flags | OEP_NO_HASH_CHECK)) > { > - if (arg0 != arg1 && !(flags & OEP_DECL_NAME)) > + if (arg0 != arg1 && !(flags & (OEP_DECL_NAME | OEP_STRUCTURAL_EQ))) > { > inchash::hash hstate0 (0), hstate1 (0); > hash_operand (arg0, hstate0, flags | OEP_HASH_CHECK); > @@ -17357,6 +17385,95 @@ test_arithmetic_folding () > x); > } > > +namespace test_operand_equality { > + > +/* Verify structural equality. */ > + > +/* Execute fold_vec_perm_cst unit tests. */ > + > +static void > +test () > +{ > + tree stype = integer_type_node; > + tree utype = unsigned_type_node; > + tree x = create_tmp_var_raw (stype, "x"); > + tree y = create_tmp_var_raw (stype, "y"); > + tree z = create_tmp_var_raw (stype, "z"); > + tree four = build_int_cst (stype, 4); > + tree lhs1 = fold_build2 (PLUS_EXPR, stype, x, y); > + tree rhs1 = fold_convert (stype, > + fold_build2 (PLUS_EXPR, utype, > + fold_convert (utype, x), > + fold_convert (utype, y))); > + > + /* (int)((unsigned x) + (unsigned y)) == x + y. */ > + ASSERT_TRUE (operand_equal_p (lhs1, rhs1, OEP_STRUCTURAL_EQ)); > + ASSERT_FALSE (operand_equal_p (lhs1, rhs1, 0)); > + > + /* (int)(unsigned) x == x. */ > + tree lhs2 = build1 (NOP_EXPR, stype, > + build1 (NOP_EXPR, utype, x)); > + tree rhs2 = x; > + ASSERT_TRUE (operand_equal_p (lhs2, rhs2, OEP_STRUCTURAL_EQ)); > + ASSERT_TRUE (operand_equal_p (lhs2, rhs2, 0)); That's not too useful - we'd never see this due to folding. > + > + /* (unsigned x) + (unsigned y) == x + y. */ > + tree lhs3 = lhs1; > + tree rhs3 = fold_build2 (PLUS_EXPR, utype, > + fold_convert (utype, x), > + fold_convert (utype, y)); > + ASSERT_TRUE (operand_equal_p (lhs3, rhs3, OEP_STRUCTURAL_EQ)); > + ASSERT_FALSE (operand_equal_p (lhs3, rhs3, 0)); > + > + /* (unsigned x) / (unsigned y) == x / y. */ > + tree lhs4 = fold_build2 (TRUNC_DIV_EXPR, stype, x, y);; > + tree rhs4 = fold_build2 (TRUNC_DIV_EXPR, utype, > + fold_convert (utype, x), > + fold_convert (utype, y)); > + ASSERT_FALSE (operand_equal_p (lhs4, rhs4, OEP_STRUCTURAL_EQ)); > + ASSERT_FALSE (operand_equal_p (lhs4, rhs4, 0)); > + > + /* (unsigned x) / 4 == x / 4. */ > + tree lhs5 = fold_build2 (TRUNC_DIV_EXPR, stype, x, four);; > + tree rhs5 = fold_build2 (TRUNC_DIV_EXPR, utype, > + fold_convert (utype, x), > + fold_convert (utype, four)); > + ASSERT_FALSE (operand_equal_p (lhs5, rhs5, OEP_STRUCTURAL_EQ)); > + ASSERT_FALSE (operand_equal_p (lhs5, rhs5, 0)); > + > + /* (unsigned x) + 4 == x + 4. */ > + tree lhs6 = fold_build2 (PLUS_EXPR, stype, x, four); > + tree rhs6 = fold_build2 (PLUS_EXPR, utype, > + fold_convert (utype, x), > + fold_convert (utype, four)); > + ASSERT_TRUE (operand_equal_p (lhs6, rhs6, OEP_STRUCTURAL_EQ)); > + ASSERT_FALSE (operand_equal_p (lhs6, rhs6, 0)); > + > + /* (unsigned x) + 4 == 4 + x. */ > + tree lhs7 = fold_build2 (PLUS_EXPR, stype, four, x); > + tree rhs7 = fold_build2 (PLUS_EXPR, utype, > + fold_convert (utype, x), > + fold_convert (utype, four)); > + ASSERT_TRUE (operand_equal_p (lhs7, rhs7, OEP_STRUCTURAL_EQ)); > + ASSERT_FALSE (operand_equal_p (lhs7, rhs7, 0)); > + > + /* ((unsigned x) + 4) * (unsigned y)) + z == ((4 + x) * y) + z. */ > + tree lhs8 = fold_build2 (PLUS_EXPR, stype, > + fold_build2 (MULT_EXPR, stype, > + fold_build2 (PLUS_EXPR, stype, four, > x), > + y), > + z); > + tree rhs8 = fold_build2 (MULT_EXPR, utype, > + fold_build2 (PLUS_EXPR, utype, > + fold_convert (utype, x), > + fold_convert (utype, four)), > + fold_convert (utype, y)); > + rhs8 = fold_build2 (PLUS_EXPR, stype, fold_convert (stype, rhs8), z); > + ASSERT_TRUE (operand_equal_p (lhs8, rhs8, OEP_STRUCTURAL_EQ)); > + ASSERT_FALSE (operand_equal_p (lhs8, rhs8, 0)); > +} > +} > + > namespace test_fold_vec_perm_cst { > > /* Build a VECTOR_CST corresponding to VMODE, and has > @@ -18150,6 +18267,7 @@ fold_const_cc_tests () > test_vector_folding (); > test_vec_duplicate_folding (); > test_fold_vec_perm_cst::test (); > + test_operand_equality::test (); > } > > } // namespace selftest > diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_2.f90 > b/gcc/testsuite/gfortran.dg/addressing-modes_2.f90 > new file mode 100644 > index > 0000000000000000000000000000000000000000..8eee4be3dc4d69fecfacd4c2e24a4973c8539fae > --- /dev/null > +++ b/gcc/testsuite/gfortran.dg/addressing-modes_2.f90 > @@ -0,0 +1,20 @@ > +! { dg-do compile { target aarch64-*-* } } > +! { dg-additional-options "-w -Ofast -fdump-tree-ivopts-all" } > + > +module a > +integer, parameter :: b=3, c=b > +contains > +subroutine d(block) > +integer f, col , block(:, :, :), e > +do f = 1, c > + do col = 1, c > + block(:f, :, e()) = do > + end do > + end do > + end > + end > + > +! { dg-final { scan-tree-dump-not {Selected IV set for loop .+ niters, 3 > IVs:} ivopts } } > +! { dg-final { scan-tree-dump-times {Selected IV set for loop .+ niters, 2 > IVs:} 1 ivopts } } > +! { dg-final { scan-tree-dump-times {Selected IV set for loop .+ niters, 1 > IVs:} 1 ivopts } } > + > diff --git a/gcc/tree-core.h b/gcc/tree-core.h > index > 27c569c77022227643151fa4a909a95f3d45bf55..fe398d99bcaa9e8fcec45e5e07d133bc393aa4a7 > 100644 > --- a/gcc/tree-core.h > +++ b/gcc/tree-core.h > @@ -962,7 +962,9 @@ enum operand_equal_flag { > /* In conjunction with OEP_LEXICOGRAPHIC considers names of declarations > of the same kind. Used to compare VLA bounds involving parameters > across redeclarations of the same function. */ > - OEP_DECL_NAME = 512 > + OEP_DECL_NAME = 512, > + /* Check for structural equivalence and ignore NOP_CONVERT casts. */ This should better document the assumptions - it treats expressions producing the same bit value even though they differ in sign the same(?), and it ignores differences in UB(?). OEP_BITWISE is a bit of an odd thing here, it seems to require type compatibility for constants in particular - something operand_equal_p usually doesn't. But while ignoring sign-conversions there it doesn't elsewhere (it presents itself as stricter than the default). I'd have called the new mode OEP_BITWISE if that were not already taken ... > + OEP_STRUCTURAL_EQ = 1024 > }; > > /* Enum and arrays used for tree allocation stats. > diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc > index > c3218a3e8eedbb8d0a7f14c01eeb069cb6024c29..b05e4ac1e63b5c110707a8a3ed5e614b7ccc702f > 100644 > --- a/gcc/tree-ssa-loop-ivopts.cc > +++ b/gcc/tree-ssa-loop-ivopts.cc > @@ -1623,8 +1623,8 @@ record_group_use (struct ivopts_data *data, tree *use_p, > > /* Check if it has the same stripped base and step. */ > if (operand_equal_p (iv->base_object, use->iv->base_object, 0) > - && operand_equal_p (iv->step, use->iv->step, 0) > - && operand_equal_p (addr_base, use->addr_base, 0)) > + && operand_equal_p (iv->step, use->iv->step, OEP_STRUCTURAL_EQ) > + && operand_equal_p (addr_base, use->addr_base, > OEP_STRUCTURAL_EQ)) > break; > } > if (i == data->vgroups.length ()) > > > > > --