RE: [PATCH][ivopts]: use affine_tree when comparing IVs during candidate selection [PR114932]

Tamar Christina Wed, 19 Jun 2024 07:26:41 -0700

> -----Original Message-----
> From: Richard Biener <rguent...@suse.de>
> Sent: Wednesday, June 19, 2024 12:55 PM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; bin.ch...@linux.alibaba.com
> Subject: Re: [PATCH][ivopts]: use affine_tree when comparing IVs during 
> candidate
> selection [PR114932]
> 
> On Fri, 14 Jun 2024, Tamar Christina wrote:
> 
> > Hi All,
> >
> > IVOPTS normally uses affine trees to perform comparisons between different 
> > IVs,
> > but these seem to have been missing in two key spots and instead normal tree
> > equivalencies used.
> >
> > In some cases where we have a structural equivalence but not a signedness
> > equivalencies we end up generating both a signed and unsigned IV for the 
> > same
> > candidate.
> >
> > This happens quite a lot with fortran but can also happen in C because this 
> > came
> > code is unable to figure out when one expression is a multiple of another.
> >
> > As an example in the attached testcase we get:
> >
> > Initial set of candidates:
> >   cost: 24 (complexity 3)
> >   reg_cost: 9
> >   cand_cost: 15
> >   cand_group_cost: 0 (complexity 3)
> >   candidates: 1, 6, 8
> >    group:0 --> iv_cand:6, cost=(0,1)
> >    group:1 --> iv_cand:1, cost=(0,0)
> >    group:2 --> iv_cand:8, cost=(0,1)
> >    group:3 --> iv_cand:8, cost=(0,1)
> >   invariant variables: 6
> >   invariant expressions: 1, 2
> >
> > <Invariant Expressions>:
> > inv_expr 1:     stride.3_27 * 4
> > inv_expr 2:     (unsigned long) stride.3_27 * 4
> >
> > These end up being used in the same group:
> >
> > Group 1:
> > cand  cost    compl.  inv.expr.       inv.vars
> > 1     0       0       NIL;    6
> > 2     0       0       NIL;    6
> > 3     0       0       NIL;    6
> >
> > which ends up with IV opts picking the signed and unsigned IVs:
> >
> > Improved to:
> >   cost: 24 (complexity 3)
> >   reg_cost: 9
> >   cand_cost: 15
> >   cand_group_cost: 0 (complexity 3)
> >   candidates: 1, 6, 8
> >    group:0 --> iv_cand:6, cost=(0,1)
> >    group:1 --> iv_cand:1, cost=(0,0)
> >    group:2 --> iv_cand:8, cost=(0,1)
> >    group:3 --> iv_cand:8, cost=(0,1)
> >   invariant variables: 6
> >   invariant expressions: 1, 2
> >
> > and so generates the same IV as both signed and unsigned:
> >
> > ;;   basic block 21, loop depth 3, count 214748368 (estimated locally, freq
> 58.2545), maybe hot
> > ;;    prev block 28, next block 31, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       28 [always]  count:23622320 (estimated locally, freq 
> > 6.4080)
> (FALLTHRU,EXECUTABLE)
> > ;;                25 [always]  count:191126046 (estimated locally, freq 
> > 51.8465)
> (FALLTHRU,DFS_BACK,EXECUTABLE)
> >   # .MEM_66 = PHI <.MEM_34(28), .MEM_22(25)>
> >   # ivtmp.22_41 = PHI <0(28), ivtmp.22_82(25)>
> >   # ivtmp.26_51 = PHI <ivtmp.26_55(28), ivtmp.26_72(25)>
> >   # ivtmp.28_90 = PHI <ivtmp.28_99(28), ivtmp.28_98(25)>
> >
> > ...
> >
> > ;;   basic block 24, loop depth 3, count 214748366 (estimated locally, freq
> 58.2545), maybe hot
> > ;;    prev block 22, next block 25, flags: (NEW, REACHABLE, VISITED)'
> > ;;    pred:       22 [always]  count:95443719 (estimated locally, freq 
> > 25.8909)
> (FALLTHRU)
> ;;                21 [33.3% (guessed)]  count:71582790 (estimated locally, 
> freq 19.4182)
> (TRUE_VALUE,EXECUTABLE)
> ;;                31 [33.3% (guessed)]  count:47721860 (estimated locally, 
> freq 12.9455)
> (TRUE_VALUE,EXECUTABLE)
> # .MEM_22 = PHI <.MEM_44(22), .MEM_31(21), .MEM_79(31)>
> 
>                                                                               
>      ivtmp.22_82 = ivtmp.22_41 + 1;
> ivtmp.26_72 = ivtmp.26_51 + _80;
> ivtmp.28_98 = ivtmp.28_90 + _39;
> >
> > These two IVs are always used as unsigned, so IV ops generates:
> >
> >   _73 = stride.3_27 * 4;
> >   _80 = (unsigned long) _73;
> >   _54 = (unsigned long) stride.3_27;
> >   _39 = _54 * 4;
> >
> > Which means that in e.g. exchange2 we generate a lot of duplicate code.
> >
> > This is because candidate 6 and 8 are structurally equivalent but have 
> > different
> > signs.
> >
> > This patch changes it so that if you have two IVs that are affine 
> > equivalent to
> > just pick one over the other.  IV already has code for this, so the patch 
> > just
> > uses affine trees instead of tree for the check.
> >
> > With it we get:
> >
> > <Invariant Expressions>:
> > inv_expr 1:     stride.3_27 * 4
> >
> > <Group-candidate Costs>:
> > Group 0:
> >   cand  cost    compl.  inv.expr.       inv.vars
> >   5     0       2       NIL;    NIL;
> >   6     0       3       NIL;    NIL;
> >
> > Group 1:
> >   cand  cost    compl.  inv.expr.       inv.vars
> >   1     0       0       NIL;    6
> >   2     0       0       NIL;    6
> >   3     0       0       NIL;    6
> >   4     0       0       NIL;    6
> >
> > Initial set of candidates:
> >   cost: 16 (complexity 3)
> >   reg_cost: 6
> >   cand_cost: 10
> >   cand_group_cost: 0 (complexity 3)
> >   candidates: 1, 6
> >    group:0 --> iv_cand:6, cost=(0,3)
> >    group:1 --> iv_cand:1, cost=(0,0)
> >   invariant variables: 6
> >   invariant expressions: 1
> >
> > The two patches together results in a 10% performance increase in exchange2 
> > in
> > SPECCPU 2017 and a 4% reduction in binary size and a 5% improvement in
> compile
> > time. There's also a 5% performance improvement in fotonik3d and similar
> > reduction in binary size.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> >     PR tree-optimization/114932
> >     * tree-affine.cc (aff_combination_constant_multiple_p): Take zero
> >     offsets into account.
> >     * tree-ssa-loop-ivopts.cc (affine_compare_eq): New.
> >     (record_group_use): Use it.
> >     (constant_multiple_of): Also check equality under
> >     aff_combination_constant_multiple_p.
> >
> > gcc/testsuite/ChangeLog:
> >
> >     PR tree-optimization/114932
> >     * gfortran.dg/addressing-modes_2.f90: New test.
> >
> > ---
> > diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_2.f90
> b/gcc/testsuite/gfortran.dg/addressing-modes_2.f90
> > new file mode 100644
> > index
> 0000000000000000000000000000000000000000..8eee4be3dc4d69fecfacd4c
> 2e24a4973c8539fae
> > --- /dev/null
> > +++ b/gcc/testsuite/gfortran.dg/addressing-modes_2.f90
> > @@ -0,0 +1,20 @@
> > +! { dg-do compile { target aarch64-*-* } }
> > +! { dg-additional-options "-w -Ofast -fdump-tree-ivopts-all" }
> > +
> > +module a
> > +integer, parameter :: b=3, c=b
> > +contains
> > +subroutine d(block)
> > +integer f, col   , block(:, :, :), e
> > +do f = 1, c
> > +   do col = 1, c
> > +             block(:f,                          :, e()) = do
> > +     end do
> > +  end do
> > +  end
> > +  end
> > +
> > +! { dg-final { scan-tree-dump-not {Selected IV set for loop .+ niters, 3 
> > IVs:} ivopts
> } }
> > +! { dg-final { scan-tree-dump-times {Selected IV set for loop .+ niters, 2 
> > IVs:} 1
> ivopts } }
> > +! { dg-final { scan-tree-dump-times {Selected IV set for loop .+ niters, 1 
> > IVs:} 1
> ivopts } }
> > +
> > diff --git a/gcc/tree-affine.cc b/gcc/tree-affine.cc
> > index
> d6309c4390362b680f0aa97a41fac3281ade66fd..b141bf23c1bbea1001b1bb28
> 6346890ddeab4096 100644
> > --- a/gcc/tree-affine.cc
> > +++ b/gcc/tree-affine.cc
> > @@ -941,6 +941,13 @@ aff_combination_constant_multiple_p (aff_tree *val,
> aff_tree *div,
> >                                  &mult_set, mult))
> >      return false;
> >
> > +  /* Everything is a multiple of 0, which means we shoudn't enforce that
> > +     mult_set is checked, since that forced the only valid multiple of
> > +     val and div to be 0 whereas 1 is also possible.  */
> > +  if (known_eq (val->offset, 0)
> > +      && known_eq (div->offset, 0))
> > +    mult_set = false;
> > +
> 
> In fact all numbers are possible?  Shouldn't this be better handled
> in wide_int_constant_multiple_p by special-casing
> known_eq (div, 0) in the known_eq (val, 0) condition by simply
> returning 'true' without checking or setting *mult_set?
> 
> The docs of wide_int_constant_multiple_p is odd:
> 
> /* If VAL != CST * DIV for any constant CST, returns false.
> 
> should that be 'If VAL == CST * DIV for no constant CST, returns false.'?
> Or s/any/all/?
>


Yeah, that condition would always be false.

> >    for (i = 0; i < div->n; i++)
> >      {
> >        class aff_comb_elt *elt
> > diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
> > index
> 7a277aaf18a9e0a32b8ac0d23332b7cd9945ef98..b11bd62a86092ba972a6487
> 64cd2facd9ddb4914 100644
> > --- a/gcc/tree-ssa-loop-ivopts.cc
> > +++ b/gcc/tree-ssa-loop-ivopts.cc
> > @@ -757,6 +757,19 @@ single_dom_exit (class loop *loop)
> >    return exit;
> >  }
> >
> > +/* Compares the given affine tree LEFT with the tree expression RIGHT and
> return
> > +   whether they are the same under affine equality.  */
> > +
> > +static bool
> > +affine_compare_eq (aff_tree &left, tree right)
> > +{
> > +  aff_tree aff_right;
> > +  tree_to_aff_combination (right, TREE_TYPE (right), &aff_right);
> > +  aff_combination_scale (&aff_right, -1);
> > +  aff_combination_add (&aff_right, &left);
> > +  return aff_combination_zero_p (&aff_right);
> > +}
> > +
> >
> >  /* Given a nested expression in ARG consisting of PLUS or MULT try to see 
> > if one
> >     of the arguments of each expression is a constant and that the type of 
> > the
> > @@ -1673,6 +1686,9 @@ record_group_use (struct ivopts_data *data, tree
> *use_p,
> >    tree addr_base = NULL;
> >    struct iv_group *group = NULL;
> >    poly_uint64 addr_offset = 0;
> > +  aff_tree iv_step, iv_addr_base;
> > +
> > +  tree_to_aff_combination (iv->step, TREE_TYPE (iv->step), &iv_step);
> >
> >    /* Record non address type use in a new group.  */
> >    if (address_p (type))
> > @@ -1683,6 +1699,7 @@ record_group_use (struct ivopts_data *data, tree
> *use_p,
> >        tree addr_toffset;
> >        split_constant_offset (iv->base, &addr_base, &addr_toffset);
> >        addr_offset = int_cst_value (addr_toffset);
> > +      tree_to_aff_combination (addr_base, TREE_TYPE (addr_base),
> &iv_addr_base);
> >        for (i = 0; i < data->vgroups.length (); i++)
> >     {
> >       struct iv_use *use;
> > @@ -1694,8 +1711,8 @@ record_group_use (struct ivopts_data *data, tree
> *use_p,
> >
> >       /* Check if it has the same stripped base and step.  */
> >       if (operand_equal_p (iv->base_object, use->iv->base_object, 0)
> > -         && operand_equal_p (iv->step, use->iv->step, 0)
> > -         && operand_equal_p (addr_base, use->addr_base, 0))
> > +         && affine_compare_eq (iv_step, use->iv->step)
> > +         && affine_compare_eq (iv_addr_base, use->addr_base))
> 
> There's only this use of addr_base so I think the opportunity is to
> turn iv_use->addr_base into aff_tree (even though that's a quite big
> representation).

Ah, true.  Will do.

> 
> For the testcase, what are the two IVs we are comparing?  I wonder
> why you need the affine compare for iv->step?
> 

Because step builds up the IV expressions and can also be an expression.

In this case:

>>> p debug (iv->step)
((unsigned long) stride.3_27) * 4
>>> p debug (use->iv->step)
(sizetype) (stride.3_27 * 4)

Note that the original expressions were both unsigned, but the STRIP_NOPS
made one signed.   They still wouldn't have compared equal though due to
the different cast locations.

For completeness the base here is

>>> p debug (use->addr_base)
(integer(kind=4) *) ((((((sizetype) prephitmp_64 - (sizetype) stride.3_27) - 
(sizetype) stride.5_29) + (sizetype) _7) + (sizetype) stride.3_27) * 4 + 
(sizetype) block.0_26)
$9 = void

>>> p debug (addr_base)
(integer(kind=4) *) ((((((sizetype) prephitmp_64 - (sizetype) stride.3_27) - 
(sizetype) stride.5_29) + (sizetype) _7) + (sizetype) stride.3_27) * 4 + 
(sizetype) block.0_26)
$10 = void
>

> >         break;
> >     }
> >        if (i == data->vgroups.length ())
> > @@ -2231,6 +2248,14 @@ constant_multiple_of (tree top, tree bot, widest_int
> *mul)
> >        return true;
> >      }
> >
> > +  aff_tree aff_top, aff_bot;
> > +  tree_to_aff_combination (top, TREE_TYPE (top), &aff_top);
> > +  tree_to_aff_combination (bot, TREE_TYPE (bot), &aff_bot);
> > +  poly_widest_int poly_mul;
> > +  if (aff_combination_constant_multiple_p (&aff_top, &aff_bot, &poly_mul)
> > +      && poly_mul.is_constant (mul))
> > +    return true;
> > +
> 
> So why does stripping nops not work here?

So this is where we compare different IV expressions to determine which
IVs compute the same thing and thus can be in the same group.

The STRIP_NOPS don't work because while the incoming types are the same
the casts are different.  So:

>>> p debug (ustep)
(unsigned long) stride.3_27 * 4
$3 = void
>>> p debug (cstep)
(unsigned long) (stride.3_27 * 4)
$4 = void

Which is of course stripped to:

>>> p debug (top)
(unsigned long) stride.3_27 * 4
$1 = void
>>> p debug (bot)
stride.3_27 * 4

Both of these compute the same thing so by doing the affine compare they
end up in the same IV groups and can be costed.  Later during candidate 
selection
these are the steps we're comparing to see if the candidate is the same 
invariant.

The full list is:

<IV Groups>:
Group 0:
  Type: REFERENCE ADDRESS
  Use 0.0:
    At stmt:    *_4 = _33;
    At pos:     *_4
    IV struct:
      Type:     integer(kind=4) *
      Base:     (integer(kind=4) *) ((((unsigned long) stride.3_27 + (unsigned 
long) _36) * 4 + (unsigned long) block.0_26) + 4)
      Step:     (sizetype) (stride.3_27 * 4)
      Object:   (void *) block.0_26
      Biv:      N
      Overflowness wrto loop niter:     Overflow
  Use 0.1:
    At stmt:    *_78 = _33;
    At pos:     *_78
    IV struct:
      Type:     integer(kind=4) *
      Base:     (integer(kind=4) *) ((((unsigned long) stride.3_27 + (unsigned 
long) _36) * 4 + (unsigned long) block.0_26) + 8)
      Step:     (unsigned long) stride.3_27 * 4
      Object:   (void *) block.0_26
      Biv:      N
      Overflowness wrto loop niter:     Overflow
  Use 0.2:
    At stmt:    *_45 = _33;
    At pos:     *_45
    IV struct:
      Type:     integer(kind=4) *
      Base:     (integer(kind=4) *) ((((unsigned long) stride.3_27 + (unsigned 
long) _36) * 4 + (unsigned long) block.0_26) + 12)
      Step:     (unsigned long) stride.3_27 * 4
      Object:   (void *) block.0_26
      Biv:      N
      Overflowness wrto loop niter:     Overflow

And all these IVs are the same but with a slightly different base offset.  By 
doing the affine compare here IV ops sees that
The best candidate is:

Candidate 6:
  Depend on inv.exprs: 1
  Var befor: ivtmp.26_51
  Var after: ivtmp.26_72
  Incr POS: before exit test
  IV struct:
    Type:       unsigned long
    Base:       ((unsigned long) stride.3_27 + (unsigned long) _36) * 4 + 
(unsigned long) block.0_26
    Step:       (unsigned long) (stride.3_27 * 4)
    Object:     (void *) block.0_26
    Biv:        N
    Overflowness wrto loop niter:       Overflow

And just builds the three IVs from the same candidate.

Does this cover what you wanted?

Cheers,
Tamar

> 
> >    code = TREE_CODE (top);
> >    switch (code)
> >      {
> >
> >
> >
> >
> >
> 
> --
> Richard Biener <rguent...@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

RE: [PATCH][ivopts]: use affine_tree when comparing IVs during candidate selection [PR114932]

Reply via email to