On Mon, 24 Jun 2024, Tamar Christina wrote: > > > > -----Original Message----- > > From: Richard Biener <rguent...@suse.de> > > Sent: Thursday, June 20, 2024 8:49 AM > > To: Tamar Christina <tamar.christ...@arm.com> > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; bin.ch...@linux.alibaba.com > > Subject: RE: [PATCH][ivopts]: use affine_tree when comparing IVs during > > candidate > > selection [PR114932] > > > > On Wed, 19 Jun 2024, Tamar Christina wrote: > > > > > > -----Original Message----- > > > > From: Richard Biener <rguent...@suse.de> > > > > Sent: Wednesday, June 19, 2024 12:55 PM > > > > To: Tamar Christina <tamar.christ...@arm.com> > > > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; > > bin.ch...@linux.alibaba.com > > > > Subject: Re: [PATCH][ivopts]: use affine_tree when comparing IVs during > > candidate > > > > selection [PR114932] > > > > > > > > On Fri, 14 Jun 2024, Tamar Christina wrote: > > > > > > > > > Hi All, > > > > > > > > > > IVOPTS normally uses affine trees to perform comparisons between > > > > > different > > IVs, > > > > > but these seem to have been missing in two key spots and instead > > > > > normal > > tree > > > > > equivalencies used. > > > > > > > > > > In some cases where we have a structural equivalence but not a > > > > > signedness > > > > > equivalencies we end up generating both a signed and unsigned IV for > > > > > the > > same > > > > > candidate. > > > > > > > > > > This happens quite a lot with fortran but can also happen in C > > > > > because this > > came > > > > > code is unable to figure out when one expression is a multiple of > > > > > another. > > > > > > > > > > As an example in the attached testcase we get: > > > > > > > > > > Initial set of candidates: > > > > > cost: 24 (complexity 3) > > > > > reg_cost: 9 > > > > > cand_cost: 15 > > > > > cand_group_cost: 0 (complexity 3) > > > > > candidates: 1, 6, 8 > > > > > group:0 --> iv_cand:6, cost=(0,1) > > > > > group:1 --> iv_cand:1, cost=(0,0) > > > > > group:2 --> iv_cand:8, cost=(0,1) > > > > > group:3 --> iv_cand:8, cost=(0,1) > > > > > invariant variables: 6 > > > > > invariant expressions: 1, 2 > > > > > > > > > > <Invariant Expressions>: > > > > > inv_expr 1: stride.3_27 * 4 > > > > > inv_expr 2: (unsigned long) stride.3_27 * 4 > > > > > > > > > > These end up being used in the same group: > > > > > > > > > > Group 1: > > > > > cand cost compl. inv.expr. inv.vars > > > > > 1 0 0 NIL; 6 > > > > > 2 0 0 NIL; 6 > > > > > 3 0 0 NIL; 6 > > > > > > > > > > which ends up with IV opts picking the signed and unsigned IVs: > > > > > > > > > > Improved to: > > > > > cost: 24 (complexity 3) > > > > > reg_cost: 9 > > > > > cand_cost: 15 > > > > > cand_group_cost: 0 (complexity 3) > > > > > candidates: 1, 6, 8 > > > > > group:0 --> iv_cand:6, cost=(0,1) > > > > > group:1 --> iv_cand:1, cost=(0,0) > > > > > group:2 --> iv_cand:8, cost=(0,1) > > > > > group:3 --> iv_cand:8, cost=(0,1) > > > > > invariant variables: 6 > > > > > invariant expressions: 1, 2 > > > > > > > > > > and so generates the same IV as both signed and unsigned: > > > > > > > > > > ;; basic block 21, loop depth 3, count 214748368 (estimated > > > > > locally, freq > > > > 58.2545), maybe hot > > > > > ;; prev block 28, next block 31, flags: (NEW, REACHABLE, VISITED) > > > > > ;; pred: 28 [always] count:23622320 (estimated locally, > > > > > freq 6.4080) > > > > (FALLTHRU,EXECUTABLE) > > > > > ;; 25 [always] count:191126046 (estimated locally, > > > > > freq 51.8465) > > > > (FALLTHRU,DFS_BACK,EXECUTABLE) > > > > > # .MEM_66 = PHI <.MEM_34(28), .MEM_22(25)> > > > > > # ivtmp.22_41 = PHI <0(28), ivtmp.22_82(25)> > > > > > # ivtmp.26_51 = PHI <ivtmp.26_55(28), ivtmp.26_72(25)> > > > > > # ivtmp.28_90 = PHI <ivtmp.28_99(28), ivtmp.28_98(25)> > > > > > > > > > > ... > > > > > > > > > > ;; basic block 24, loop depth 3, count 214748366 (estimated > > > > > locally, freq > > > > 58.2545), maybe hot > > > > > ;; prev block 22, next block 25, flags: (NEW, REACHABLE, VISITED)' > > > > > ;; pred: 22 [always] count:95443719 (estimated locally, > > > > > freq 25.8909) > > > > (FALLTHRU) > > > > ;; 21 [33.3% (guessed)] count:71582790 (estimated > > > > locally, freq > > 19.4182) > > > > (TRUE_VALUE,EXECUTABLE) > > > > ;; 31 [33.3% (guessed)] count:47721860 (estimated > > > > locally, freq > > 12.9455) > > > > (TRUE_VALUE,EXECUTABLE) > > > > # .MEM_22 = PHI <.MEM_44(22), .MEM_31(21), .MEM_79(31)> > > > > > > > > > > > > ivtmp.22_82 = ivtmp.22_41 + 1; > > > > ivtmp.26_72 = ivtmp.26_51 + _80; > > > > ivtmp.28_98 = ivtmp.28_90 + _39; > > > > > > > > > > These two IVs are always used as unsigned, so IV ops generates: > > > > > > > > > > _73 = stride.3_27 * 4; > > > > > _80 = (unsigned long) _73; > > > > > _54 = (unsigned long) stride.3_27; > > > > > _39 = _54 * 4; > > > > > > > > > > Which means that in e.g. exchange2 we generate a lot of duplicate > > > > > code. > > > > > > > > > > This is because candidate 6 and 8 are structurally equivalent but have > > different > > > > > signs. > > > > > > > > > > This patch changes it so that if you have two IVs that are affine > > > > > equivalent to > > > > > just pick one over the other. IV already has code for this, so the > > > > > patch just > > > > > uses affine trees instead of tree for the check. > > > > > > > > > > With it we get: > > > > > > > > > > <Invariant Expressions>: > > > > > inv_expr 1: stride.3_27 * 4 > > > > > > > > > > <Group-candidate Costs>: > > > > > Group 0: > > > > > cand cost compl. inv.expr. inv.vars > > > > > 5 0 2 NIL; NIL; > > > > > 6 0 3 NIL; NIL; > > > > > > > > > > Group 1: > > > > > cand cost compl. inv.expr. inv.vars > > > > > 1 0 0 NIL; 6 > > > > > 2 0 0 NIL; 6 > > > > > 3 0 0 NIL; 6 > > > > > 4 0 0 NIL; 6 > > > > > > > > > > Initial set of candidates: > > > > > cost: 16 (complexity 3) > > > > > reg_cost: 6 > > > > > cand_cost: 10 > > > > > cand_group_cost: 0 (complexity 3) > > > > > candidates: 1, 6 > > > > > group:0 --> iv_cand:6, cost=(0,3) > > > > > group:1 --> iv_cand:1, cost=(0,0) > > > > > invariant variables: 6 > > > > > invariant expressions: 1 > > > > > > > > > > The two patches together results in a 10% performance increase in > > > > > exchange2 > > in > > > > > SPECCPU 2017 and a 4% reduction in binary size and a 5% improvement in > > > > compile > > > > > time. There's also a 5% performance improvement in fotonik3d and > > > > > similar > > > > > reduction in binary size. > > > > > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > > > > > > > > > Ok for master? > > > > > > > > > > Thanks, > > > > > Tamar > > > > > > > > > > gcc/ChangeLog: > > > > > > > > > > PR tree-optimization/114932 > > > > > * tree-affine.cc (aff_combination_constant_multiple_p): Take > > > > > zero > > > > > offsets into account. > > > > > * tree-ssa-loop-ivopts.cc (affine_compare_eq): New. > > > > > (record_group_use): Use it. > > > > > (constant_multiple_of): Also check equality under > > > > > aff_combination_constant_multiple_p. > > > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > > > PR tree-optimization/114932 > > > > > * gfortran.dg/addressing-modes_2.f90: New test. > > > > > > > > > > --- > > > > > diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_2.f90 > > > > b/gcc/testsuite/gfortran.dg/addressing-modes_2.f90 > > > > > new file mode 100644 > > > > > index > > > > > > 0000000000000000000000000000000000000000..8eee4be3dc4d69fecfacd4c > > > > 2e24a4973c8539fae > > > > > --- /dev/null > > > > > +++ b/gcc/testsuite/gfortran.dg/addressing-modes_2.f90 > > > > > @@ -0,0 +1,20 @@ > > > > > +! { dg-do compile { target aarch64-*-* } } > > > > > +! { dg-additional-options "-w -Ofast -fdump-tree-ivopts-all" } > > > > > + > > > > > +module a > > > > > +integer, parameter :: b=3, c=b > > > > > +contains > > > > > +subroutine d(block) > > > > > +integer f, col , block(:, :, :), e > > > > > +do f = 1, c > > > > > + do col = 1, c > > > > > + block(:f, :, e()) = do > > > > > + end do > > > > > + end do > > > > > + end > > > > > + end > > > > > + > > > > > +! { dg-final { scan-tree-dump-not {Selected IV set for loop .+ > > > > > niters, 3 IVs:} > > ivopts > > > > } } > > > > > +! { dg-final { scan-tree-dump-times {Selected IV set for loop .+ > > > > > niters, 2 IVs:} > > 1 > > > > ivopts } } > > > > > +! { dg-final { scan-tree-dump-times {Selected IV set for loop .+ > > > > > niters, 1 IVs:} > > 1 > > > > ivopts } } > > > > > + > > > > > diff --git a/gcc/tree-affine.cc b/gcc/tree-affine.cc > > > > > index > > > > > > d6309c4390362b680f0aa97a41fac3281ade66fd..b141bf23c1bbea1001b1bb28 > > > > 6346890ddeab4096 100644 > > > > > --- a/gcc/tree-affine.cc > > > > > +++ b/gcc/tree-affine.cc > > > > > @@ -941,6 +941,13 @@ aff_combination_constant_multiple_p (aff_tree > > *val, > > > > aff_tree *div, > > > > > &mult_set, mult)) > > > > > return false; > > > > > > > > > > + /* Everything is a multiple of 0, which means we shoudn't enforce > > > > > that > > > > > + mult_set is checked, since that forced the only valid multiple > > > > > of > > > > > + val and div to be 0 whereas 1 is also possible. */ > > > > > + if (known_eq (val->offset, 0) > > > > > + && known_eq (div->offset, 0)) > > > > > + mult_set = false; > > > > > + > > > > > > > > In fact all numbers are possible? Shouldn't this be better handled > > > > in wide_int_constant_multiple_p by special-casing > > > > known_eq (div, 0) in the known_eq (val, 0) condition by simply > > > > returning 'true' without checking or setting *mult_set? > > > > > > > > The docs of wide_int_constant_multiple_p is odd: > > > > > > > > /* If VAL != CST * DIV for any constant CST, returns false. > > > > > > > > should that be 'If VAL == CST * DIV for no constant CST, returns > > > > false.'? > > > > Or s/any/all/? > > > > > > > > > > Yeah, that condition would always be false. > > > > Can you split out a patch to fix this? > > Doing now. > > > > > > > > for (i = 0; i < div->n; i++) > > > > > { > > > > > class aff_comb_elt *elt > > > > > diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc > > > > > index > > > > > > 7a277aaf18a9e0a32b8ac0d23332b7cd9945ef98..b11bd62a86092ba972a6487 > > > > 64cd2facd9ddb4914 100644 > > > > > --- a/gcc/tree-ssa-loop-ivopts.cc > > > > > +++ b/gcc/tree-ssa-loop-ivopts.cc > > > > > @@ -757,6 +757,19 @@ single_dom_exit (class loop *loop) > > > > > return exit; > > > > > } > > > > > > > > > > +/* Compares the given affine tree LEFT with the tree expression > > > > > RIGHT and > > > > return > > > > > + whether they are the same under affine equality. */ > > > > > + > > > > > +static bool > > > > > +affine_compare_eq (aff_tree &left, tree right) > > > > > +{ > > > > > + aff_tree aff_right; > > > > > + tree_to_aff_combination (right, TREE_TYPE (right), &aff_right); > > > > > + aff_combination_scale (&aff_right, -1); > > > > > + aff_combination_add (&aff_right, &left); > > > > > + return aff_combination_zero_p (&aff_right); > > > > > +} > > > > > + > > > > > > > > > > /* Given a nested expression in ARG consisting of PLUS or MULT try > > > > > to see if > > one > > > > > of the arguments of each expression is a constant and that the > > > > > type of the > > > > > @@ -1673,6 +1686,9 @@ record_group_use (struct ivopts_data *data, tree > > > > *use_p, > > > > > tree addr_base = NULL; > > > > > struct iv_group *group = NULL; > > > > > poly_uint64 addr_offset = 0; > > > > > + aff_tree iv_step, iv_addr_base; > > > > > + > > > > > + tree_to_aff_combination (iv->step, TREE_TYPE (iv->step), &iv_step); > > > > > > > > > > /* Record non address type use in a new group. */ > > > > > if (address_p (type)) > > > > > @@ -1683,6 +1699,7 @@ record_group_use (struct ivopts_data *data, tree > > > > *use_p, > > > > > tree addr_toffset; > > > > > split_constant_offset (iv->base, &addr_base, &addr_toffset); > > > > > addr_offset = int_cst_value (addr_toffset); > > > > > + tree_to_aff_combination (addr_base, TREE_TYPE (addr_base), > > > > &iv_addr_base); > > > > > for (i = 0; i < data->vgroups.length (); i++) > > > > > { > > > > > struct iv_use *use; > > > > > @@ -1694,8 +1711,8 @@ record_group_use (struct ivopts_data *data, tree > > > > *use_p, > > > > > > > > > > /* Check if it has the same stripped base and step. */ > > > > > if (operand_equal_p (iv->base_object, use->iv->base_object, 0) > > > > > - && operand_equal_p (iv->step, use->iv->step, 0) > > > > > - && operand_equal_p (addr_base, use->addr_base, 0)) > > > > > + && affine_compare_eq (iv_step, use->iv->step) > > > > > + && affine_compare_eq (iv_addr_base, use->addr_base)) > > > > > > > > There's only this use of addr_base so I think the opportunity is to > > > > turn iv_use->addr_base into aff_tree (even though that's a quite big > > > > representation). > > > > > > Ah, true. Will do. > > > > > > > > > > > For the testcase, what are the two IVs we are comparing? I wonder > > > > why you need the affine compare for iv->step? > > > > > > > > > > Because step builds up the IV expressions and can also be an expression. > > > > > > In this case: > > > > > > >>> p debug (iv->step) > > > ((unsigned long) stride.3_27) * 4 > > > >>> p debug (use->iv->step) > > > (sizetype) (stride.3_27 * 4) > > > > > > Note that the original expressions were both unsigned, but the STRIP_NOPS > > > made one signed. They still wouldn't have compared equal though due to > > > the different cast locations. > > > > > > For completeness the base here is > > > > > > >>> p debug (use->addr_base) > > > (integer(kind=4) *) ((((((sizetype) prephitmp_64 - (sizetype) > > > stride.3_27) - > > (sizetype) stride.5_29) + (sizetype) _7) + (sizetype) stride.3_27) * 4 + > > (sizetype) > > block.0_26) > > > $9 = void > > > > > > >>> p debug (addr_base) > > > (integer(kind=4) *) ((((((sizetype) prephitmp_64 - (sizetype) > > > stride.3_27) - > > (sizetype) stride.5_29) + (sizetype) _7) + (sizetype) stride.3_27) * 4 + > > (sizetype) > > block.0_26) > > > $10 = void > > > > > > > > > > > > break; > > > > > } > > > > > if (i == data->vgroups.length ()) > > > > > @@ -2231,6 +2248,14 @@ constant_multiple_of (tree top, tree bot, > > widest_int > > > > *mul) > > > > > return true; > > > > > } > > > > > > > > > > + aff_tree aff_top, aff_bot; > > > > > + tree_to_aff_combination (top, TREE_TYPE (top), &aff_top); > > > > > + tree_to_aff_combination (bot, TREE_TYPE (bot), &aff_bot); > > > > > + poly_widest_int poly_mul; > > > > > + if (aff_combination_constant_multiple_p (&aff_top, &aff_bot, > > > > > &poly_mul) > > > > > + && poly_mul.is_constant (mul)) > > > > > + return true; > > > > > + > > > > > > > > So why does stripping nops not work here? > > > > > > So this is where we compare different IV expressions to determine which > > > IVs compute the same thing and thus can be in the same group. > > > > > > The STRIP_NOPS don't work because while the incoming types are the same > > > the casts are different. So: > > > > > > >>> p debug (ustep) > > > (unsigned long) stride.3_27 * 4 > > > $3 = void > > > >>> p debug (cstep) > > > (unsigned long) (stride.3_27 * 4) > > > $4 = void > > > > > > Which is of course stripped to: > > > > > > >>> p debug (top) > > > (unsigned long) stride.3_27 * 4 > > > $1 = void > > > >>> p debug (bot) > > > stride.3_27 * 4 > > > > So we're expecting constant_multiple_of to compute *mul == 1, proving > > equality? > > > > Indeed > > > The function seems to try stripping ops from TOP until it reaches > > an expression equal to BOT and that's what fails to trigger here. > > > > What I originally wondered was iff we compute the affine combinations > > why not use only aff_combination_constant_multiple_p? > > > > I think that's probably easier, The rest of the code seems to indeed be > repeating the work of aff_combination_constant_multiple_p. > > I can try replacing the whole thing with aff_combination_constant_multiple_p > and see?
Yes. > > I might also point back to the idea I threw in somewhere, adding > > OEP_VALUE (or a better name) to the set of flags accepted by > > operand_equal_p. You mentioned hashing IIRC but I don't see the patches > > touching hashing? > > > > Yes, That can indeed be done with this approach. The hashing was that before > I > was trying to prevent the "duplicate" IV expressions from being created in the > first place by modifying get_loop_invariant_expr. > > This function looks up if we have already seen a particular IV expression and > if > we have it just returns that expression. However after reading more of the > code > I realized this wasn't the right approach, as without also dealing with the > candidates > we'd end up creating IV expression that can't be handled by any candidate. > > IVops would just give up then. Reading the code it seems that > get_loop_invariant_expr > is just there to prevent blatant duplicates. i.e. it treats `(signed) a` and > `a` as the same. > > This is also why I think that everywhere else *has* to continue stripping the > expression. > > On a note from Richard S that he thought IVopts already had some code to deal > with > expressions that differ only in signs led me to take a different approach. > > The goal wasn't to remove the different sign/unsigned IV expressions, but > instead get > Then to be servable by the same candidates. i.e. we want them in the same > candidate > groups and then candidate optimization will just do its thing. > > That seemed a more natural fit to how it worked. Yeah, I agree that sounds like the better strathegy. > Do you want me to try the operand_equal_p approach? Though in this case the > issue > is we not only need to know they're equal, but also need to know the scale > factor. For this case yes, but if you'd keep the code as-is, the equal with scale factor one case would be fixed. Not a case with different scale factors though - but conversions "elsewhere" should be handled via the stripping. So it would work to simply adjust the operand_equal_p check here? > get_computation_aff_1 scales the common type IV by the scale we determined, > so I think operand_equal_p would not be very useful here. But it does look > like > constant_multiple_of can just be implemented with > aff_combination_constant_multiple_p. > > Should I try? You've had the other places where you replace operand_equal_p with affine-compute and compare. As said that has some associated cost as well as a limit on the number of elements after which it resorts back to operand_equal_p. So for strict equality tests implementing a weaker operand_equal_p might be a better solution. Richard. > Thanks, > Tamar > > > > > Both of these compute the same thing so by doing the affine compare they > > > end up in the same IV groups and can be costed. Later during candidate > > > selection > > > these are the steps we're comparing to see if the candidate is the same > > > invariant. > > > > > > The full list is: > > > > > > <IV Groups>: > > > Group 0: > > > Type: REFERENCE ADDRESS > > > Use 0.0: > > > At stmt: *_4 = _33; > > > At pos: *_4 > > > IV struct: > > > Type: integer(kind=4) * > > > Base: (integer(kind=4) *) ((((unsigned long) stride.3_27 + > > > (unsigned long) > > _36) * 4 + (unsigned long) block.0_26) + 4) > > > Step: (sizetype) (stride.3_27 * 4) > > > Object: (void *) block.0_26 > > > Biv: N > > > Overflowness wrto loop niter: Overflow > > > Use 0.1: > > > At stmt: *_78 = _33; > > > At pos: *_78 > > > IV struct: > > > Type: integer(kind=4) * > > > Base: (integer(kind=4) *) ((((unsigned long) stride.3_27 + > > > (unsigned long) > > _36) * 4 + (unsigned long) block.0_26) + 8) > > > Step: (unsigned long) stride.3_27 * 4 > > > Object: (void *) block.0_26 > > > Biv: N > > > Overflowness wrto loop niter: Overflow > > > Use 0.2: > > > At stmt: *_45 = _33; > > > At pos: *_45 > > > IV struct: > > > Type: integer(kind=4) * > > > Base: (integer(kind=4) *) ((((unsigned long) stride.3_27 + > > > (unsigned long) > > _36) * 4 + (unsigned long) block.0_26) + 12) > > > Step: (unsigned long) stride.3_27 * 4 > > > Object: (void *) block.0_26 > > > Biv: N > > > Overflowness wrto loop niter: Overflow > > > > > > And all these IVs are the same but with a slightly different base offset. > > > By doing > > the affine compare here IV ops sees that > > > The best candidate is: > > > > > > Candidate 6: > > > Depend on inv.exprs: 1 > > > Var befor: ivtmp.26_51 > > > Var after: ivtmp.26_72 > > > Incr POS: before exit test > > > IV struct: > > > Type: unsigned long > > > Base: ((unsigned long) stride.3_27 + (unsigned long) _36) * 4 + > > > (unsigned > > long) block.0_26 > > > Step: (unsigned long) (stride.3_27 * 4) > > > Object: (void *) block.0_26 > > > Biv: N > > > Overflowness wrto loop niter: Overflow > > > > > > And just builds the three IVs from the same candidate. > > > > > > Does this cover what you wanted? > > > > > > Cheers, > > > Tamar > > > > > > > > > > > > code = TREE_CODE (top); > > > > > switch (code) > > > > > { > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Richard Biener <rguent...@suse.de> > > > > SUSE Software Solutions Germany GmbH, > > > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > > Nuernberg) > > > > > > > -- > > Richard Biener <rguent...@suse.de> > > SUSE Software Solutions Germany GmbH, > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)