RE: [PATCH][ivopts]: perform affine fold on unsigned addressing modes known not to overflow. [PR114932]

Tamar Christina Thu, 11 Jul 2024 06:10:12 -0700

 -----Original Message-----
> From: Richard Biener <rguent...@suse.de>
> Sent: Thursday, July 11, 2024 12:39 PM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; bin.ch...@linux.alibaba.com
> Subject: RE: [PATCH][ivopts]: perform affine fold on unsigned addressing modes
> known not to overflow. [PR114932]
> 
> On Wed, 10 Jul 2024, Tamar Christina wrote:
> 
> > > -----Original Message-----
> > > From: Richard Biener <rguent...@suse.de>
> > > Sent: Thursday, June 20, 2024 8:55 AM
> > > To: Tamar Christina <tamar.christ...@arm.com>
> > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>;
> bin.ch...@linux.alibaba.com
> > > Subject: RE: [PATCH][ivopts]: perform affine fold on unsigned addressing
> modes
> > > known not to overflow. [PR114932]
> > >
> > > On Wed, 19 Jun 2024, Tamar Christina wrote:
> > >
> > > > > -----Original Message-----
> > > > > From: Richard Biener <rguent...@suse.de>
> > > > > Sent: Wednesday, June 19, 2024 1:14 PM
> > > > > To: Tamar Christina <tamar.christ...@arm.com>
> > > > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>;
> > > bin.ch...@linux.alibaba.com
> > > > > Subject: Re: [PATCH][ivopts]: perform affine fold on unsigned 
> > > > > addressing
> > > modes
> > > > > known not to overflow. [PR114932]
> > > > >
> > > > > On Fri, 14 Jun 2024, Tamar Christina wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > When the patch for PR114074 was applied we saw a good boost in
> > > exchange2.
> > > > > >
> > > > > > This boost was partially caused by a simplification of the 
> > > > > > addressing
> modes.
> > > > > > With the patch applied IV opts saw the following form for the base
> > > addressing;
> > > > > >
> > > > > >   Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) 
> > > > > > l0_19(D) *
> > > > > > 324) + 36)
> > > > > >
> > > > > > vs what we normally get:
> > > > > >
> > > > > >   Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) 
> > > > > > l0_19(D)
> > > > > > * 81) + 9) * 4
> > > > > >
> > > > > > This is because the patch promoted multiplies where one operand is a
> > > constant
> > > > > > from a signed multiply to an unsigned one, to attempt to fold away 
> > > > > > the
> > > constant.
> > > > > >
> > > > > > This patch attempts the same but due to the various problems with 
> > > > > > SCEV
> and
> > > > > > niters not being able to analyze the resulting forms (i.e. 
> > > > > > PR114322) we
> can't
> > > > > > do it during SCEV or in the general form like in fold-const like
> extract_muldiv
> > > > > > attempts.
> > > > > >
> > > > > > Instead this applies the simplification during IVopts 
> > > > > > initialization when we
> > > > > > create the IV.  Essentially when we know the IV won't overflow with
> regards
> > > to
> > > > > > niters then we perform an affine fold which gets it to simplify the 
> > > > > > internal
> > > > > > computation, even if this is signed because we know that for IVOPTs 
> > > > > > uses
> the
> > > > > > IV won't ever overflow.  This allows IV opts to see the simplified 
> > > > > > form
> > > > > > without influencing the rest of the compiler.
> > > > > >
> > > > > > as mentioned in PR114074 it would be good to fix the missed
> optimization in
> > > the
> > > > > > other passes so we can perform this in general.
> > > > > >
> > > > > > The reason this has a big impact on fortran code is that fortran 
> > > > > > doesn't
> seem
> > > to
> > > > > > have unsigned integer types.  As such all it's addressing are 
> > > > > > created with
> > > > > > signed types and folding does not happen on them due to the possible
> > > overflow.
> > > > > >
> > > > > > concretely on AArch64 this changes the results from generation:
> > > > > >
> > > > > >         mov     x27, -108
> > > > > >         mov     x24, -72
> > > > > >         mov     x23, -36
> > > > > >         add     x21, x1, x0, lsl 2
> > > > > >         add     x19, x20, x22
> > > > > > .L5:
> > > > > >         add     x0, x22, x19
> > > > > >         add     x19, x19, 324
> > > > > >         ldr     d1, [x0, x27]
> > > > > >         add     v1.2s, v1.2s, v15.2s
> > > > > >         str     d1, [x20, 216]
> > > > > >         ldr     d0, [x0, x24]
> > > > > >         add     v0.2s, v0.2s, v15.2s
> > > > > >         str     d0, [x20, 252]
> > > > > >         ldr     d31, [x0, x23]
> > > > > >         add     v31.2s, v31.2s, v15.2s
> > > > > >         str     d31, [x20, 288]
> > > > > >         bl      digits_20_
> > > > > >         cmp     x21, x19
> > > > > >         bne     .L5
> > > > > >
> > > > > > into:
> > > > > >
> > > > > > .L5:
> > > > > >         ldr     d1, [x19, -108]
> > > > > >         add     v1.2s, v1.2s, v15.2s
> > > > > >         str     d1, [x20, 216]
> > > > > >         ldr     d0, [x19, -72]
> > > > > >         add     v0.2s, v0.2s, v15.2s
> > > > > >         str     d0, [x20, 252]
> > > > > >         ldr     d31, [x19, -36]
> > > > > >         add     x19, x19, 324
> > > > > >         add     v31.2s, v31.2s, v15.2s
> > > > > >         str     d31, [x20, 288]
> > > > > >         bl      digits_20_
> > > > > >         cmp     x21, x19
> > > > > >         bne     .L5
> > > > > >
> > > > > > The two patches together results in a 10% performance increase in
> exchange2
> > > in
> > > > > > SPECCPU 2017 and a 4% reduction in binary size and a 5% improvement
> in
> > > > > compile
> > > > > > time. There's also a 5% performance improvement in fotonik3d and 
> > > > > > similar
> > > > > > reduction in binary size.
> > > > > >
> > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > > > > >
> > > > > > Ok for master?
> > > > > >
> > > > > > Thanks,
> > > > > > Tamar
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > >     PR tree-optimization/114932
> > > > > >     * tree-scalar-evolution.cc (alloc_iv): Perform affine unsigned 
> > > > > > fold.
> > > > > >
> > > > > > gcc/testsuite/ChangeLog:
> > > > > >
> > > > > >     PR tree-optimization/114932
> > > > > >     * gfortran.dg/addressing-modes_1.f90: New test.
> > > > > >
> > > > > > ---
> > > > > > diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> > > > > b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> > > > > > new file mode 100644
> > > > > > index
> > > > >
> > >
> 0000000000000000000000000000000000000000..334d5bc47a16e53e9168b
> > > > > b1f90dfeff584b4e494
> > > > > > --- /dev/null
> > > > > > +++ b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> > > > > > @@ -0,0 +1,37 @@
> > > > > > +! { dg-do compile { target aarch64-*-* } }
> > > > > > +! { dg-additional-options "-w -Ofast" }
> > > > > > +
> > > > > > +  module brute_force
> > > > > > +    integer, parameter :: r=9
> > > > > > +     integer  block(r, r, 0)
> > > > > > +    contains
> > > > > > +  subroutine brute
> > > > > > +     do
> > > > > > +      do
> > > > > > +          do
> > > > > > +           do
> > > > > > +                do
> > > > > > +                     do
> > > > > > +                         do i7 = l0, 1
> > > > > > +                       select case(1 )
> > > > > > +                       case(1)
> > > > > > +                           block(:2, 7:, 1) = block(:2, 7:, i7) - 1
> > > > > > +                       end select
> > > > > > +                            do i8 = 1, 1
> > > > > > +                               do i9 = 1, 1
> > > > > > +                            if(1 == 1) then
> > > > > > +                                    call digits_20
> > > > > > +                                end if
> > > > > > +                                end do
> > > > > > +                          end do
> > > > > > +                    end do
> > > > > > +                    end do
> > > > > > +              end do
> > > > > > +              end do
> > > > > > +           end do
> > > > > > +     end do
> > > > > > +  end do
> > > > > > + end
> > > > > > +  end
> > > > > > +
> > > > > > +! { dg-final { scan-assembler-not {ldr\s+d([0-9]+),\s+\[x[0-9]+, 
> > > > > > x[0-9]+\]}
> } }
> > > > > > diff --git a/gcc/tree-ssa-loop-ivopts.cc 
> > > > > > b/gcc/tree-ssa-loop-ivopts.cc
> > > > > > index
> > > > >
> > >
> 4338d7b64a6c2df6404b8d5e51c7f62c23006e72..f621e4ee924b930e1e1d68e3
> > > > > 5f3d3a0d52470811 100644
> > > > > > --- a/gcc/tree-ssa-loop-ivopts.cc
> > > > > > +++ b/gcc/tree-ssa-loop-ivopts.cc
> > > > > > @@ -1216,6 +1216,18 @@ alloc_iv (struct ivopts_data *data, tree 
> > > > > > base,
> tree
> > > > > step,
> > > > > >        base = fold_convert (TREE_TYPE (base), 
> > > > > > aff_combination_to_tree
> > > (&comb));
> > > > > >      }
> > > > > >
> > > > > > +  /* If we know the IV won't overflow wrt niters and the type is an
> unsigned
> > > > > > +     type then fold using affine unsigned arithmetic to allow more 
> > > > > > folding of
> > > > > > +     constants.  */
> > > > > > +  if (no_overflow
> > > > > > +      && TYPE_UNSIGNED (TREE_TYPE (expr)))
> > > > > > +    {
> > > > > > +      aff_tree comb;
> > > > > > +      tree utype = unsigned_type_for (TREE_TYPE (expr));
> > > > > > +      tree_to_aff_combination (expr, utype, &comb);
> > > > > > +      base = fold_convert (TREE_TYPE (base), 
> > > > > > aff_combination_to_tree
> > > (&comb));
> > > > > > +    }
> > > > > > +
> > > > >
> > > > > So right above we already do
> > > > >
> > > > >   /* Lower address expression in base except ones with DECL_P as 
> > > > > operand.
> > > > >      By doing this:
> > > > >        1) More accurate cost can be computed for address expressions;
> > > > >        2) Duplicate candidates won't be created for bases in different
> > > > >           forms, like &a[0] and &a.  */
> > > > >   STRIP_NOPS (expr);
> > > > >   if ((TREE_CODE (expr) == ADDR_EXPR && !DECL_P (TREE_OPERAND (expr,
> 0)))
> > > > >       || contain_complex_addr_expr (expr))
> > > > >     {
> > > > >       aff_tree comb;
> > > > >       tree_to_aff_combination (expr, TREE_TYPE (expr), &comb);
> > > > >       base = fold_convert (TREE_TYPE (base), aff_combination_to_tree
> > > > > (&comb));
> > > > >     }
> > > > >
> > > > > and if I read correctly 'expr' is
> > > > >
> > > > >   (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
> > > > > * 81) + 9) * 4
> > > > >
> > > > > in your interesting case which means it doesn't satisfy
> > > > > contain_complex_addr_expr.
> > > > >
> > > > > I don't quite get why rewriting the base into (T)(unsigned)... is
> > > > > only valid when no_overflow - no_overflow is about {base, +, step},
> > > > > not about any overflow contained in 'base'.
> > > > >
> > > >
> > > > Right, so you're saying overflow only applies to step and it's final
> > > > evolution wrt to niters? Ok, makes sense.  I had thought it referred
> > > > to that the final possible address doesn't overflow. So offset + step * 
> > > > niters.
> > >
> > > It refers to the IV itself not wrapping but 'base' is the initial
> > > value computation, whether that computation itself wraps or not
> > > (it would wrap the same for each iteration) isn't covered.
> > >
> > > > > I wonder if we maybe want to record an "original" iv->base
> > > > > to be used for code generation (and there only the unexpanded form)
> > > > > and a variant used for the various sorts of canonicalization/compare
> > > > > (I see we eventually add/subtract step and then compare against
> > > > > sth else).  And then apply this normalization always to the not
> > > > > "original" form.
> > > >
> > > > That's certainly possible, but it'll make the code a bit harder to 
> > > > follow in the
> > > > cases where the comparisons are done between the split address base + 
> > > > step
> > > > and then a new IV built up from it.
> > > >
> > > > In those cases we'd have to split both expressions but used different 
> > > > ones in
> > > > the right places.  Where it gets messy is that something it updates the 
> > > > value
> > > > and then compares.
> > > >
> > > > So if we can avoid it it's likely cleaner.
> > >
> > > Sure.
> > >
> > > > >
> > > > > The above STRIP_NOPS (expr) + expand might turn an unsigned
> > > > > affine combination into a signed one which might be problematic.
> > > > > So what happens if you change the above to simply always
> > > > > unsigned expand?
> > > > >
> > > >
> > > > Yes, that's what's currently happening. It's stripping (signed) (a + b) 
> > > > into
> > > > a + b.
> > > >
> > > > To be clear you mean change the affine fold I added to always be 
> > > > unsigned?
> > > > and thus removing the contain_complex_addr_expr chunk?
> > >
> > > Yes.
> > >
> > > > That works and was my first patch locally (though need to fully 
> > > > reg-test it).
> > >
> > > I'm curious what happens.
> > >
> >
> > Ok, We cannot fold unconditionally to unsigned, however this I believe is 
> > correct:
> >
> > The patch folds every IV to unsigned to canonicalize them.  At the end of 
> > the
> > pass we match.pd will then remove unneeded conversions.
> >
> > Note that we cannot force everything to unsigned, IVops requires that array
> > address expressions remain as such.  Folding them results in them becoming
> > pointer expressions for which some optimizations in IVopts do not run.
> 
> Isn't that what contain_complex_addr_expr tries to assert?  That there
> isn't a DECL in the expression?


Yes, but In this case, specifically for POINTER_PLUS_EXPR it's perfectly fine 
for
The base to be a DECL as we won't be folding it.  Though looking at this perhaps
the same condition should be applied to PLUS_EXPR and MINUS_EXPR.

So I think contain_complex_addr_expr should maybe not check operand0.

> 
> > If we fold array expressions it starts treating them normal integers and 
> > leads
> > to incorrect code (as shown by testsuite/gcc.dg/torture/bitint-49.c).
> >
> > And if we fold POINTER_PLUS_EXPR then this leads to inefficient codegen as 
> > is
> > shown by testsuite/gcc.target/i386/pr115462.c.  So for POINTER_PLUS_EXPR
> we can
> > fold the offset but not the base.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > x86_64-pc-linux-gnu -m32, -m64 and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> >     PR tree-optimization/114932
> >     * tree-scalar-evolution.cc (alloc_iv): Perform affine unsigned fold.
> >
> > gcc/testsuite/ChangeLog:
> >
> >     PR tree-optimization/114932
> >     * gcc.dg/tree-ssa/pr64705.c: Update dump file scan.
> >     * gfortran.dg/addressing-modes_1.f90: New test.
> >
> > -- inline copy of patch --
> >
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c 
> > b/gcc/testsuite/gcc.dg/tree-
> ssa/pr64705.c
> > index
> fd24e38a53e9f3a4659dece5af4d71fbc0ce2c18..3c9c2e5deed1ba755d1fae15a6
> 553d68b6ad0098 100644
> > --- a/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c
> > @@ -23,4 +23,4 @@ int foo(char *flags, long len, long i, long steps)
> >
> >  /* Don't expand iv {base+step, step}_loop into {base+x+y, step}_loop
> >     even if "step == x + y".  */
> > -/* { dg-final { scan-tree-dump "Base:\\tstep_\[0-9\]* \\+ 
> > iter|Base:\\titer_\[0-
> 9\]* \\+ step" "ivopts"} } */
> > +/* { dg-final { scan-tree-dump {Base:\t\(unsigned ?.*\) step_[0-9]* \+
> \(unsigned ?.*\) iter|Base:\t\(unsigned ?.*\) iter_[0-9]* \+ \(unsigned ?.*\) 
> step}
> "ivopts"} } */
> > diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> > new file mode 100644
> > index
> 0000000000000000000000000000000000000000..334d5bc47a16e53e9168b
> b1f90dfeff584b4e494
> > --- /dev/null
> > +++ b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> > @@ -0,0 +1,37 @@
> > +! { dg-do compile { target aarch64-*-* } }
> > +! { dg-additional-options "-w -Ofast" }
> > +
> > +  module brute_force
> > +    integer, parameter :: r=9
> > +     integer  block(r, r, 0)
> > +    contains
> > +  subroutine brute
> > +     do
> > +      do
> > +          do
> > +           do
> > +                do
> > +                     do
> > +                         do i7 = l0, 1
> > +                       select case(1 )
> > +                       case(1)
> > +                           block(:2, 7:, 1) = block(:2, 7:, i7) - 1
> > +                       end select
> > +                            do i8 = 1, 1
> > +                               do i9 = 1, 1
> > +                            if(1 == 1) then
> > +                                    call digits_20
> > +                                end if
> > +                                end do
> > +                          end do
> > +                    end do
> > +                    end do
> > +              end do
> > +              end do
> > +           end do
> > +     end do
> > +  end do
> > + end
> > +  end
> > +
> > +! { dg-final { scan-assembler-not {ldr\s+d([0-9]+),\s+\[x[0-9]+, 
> > x[0-9]+\]} } }
> > diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
> > index
> b05e4ac1e63b5c110707a8a3ed5e614b7ccc702f..5fc188ae3f8207943a9503e2d
> c51b067c3f85739 100644
> > --- a/gcc/tree-ssa-loop-ivopts.cc
> > +++ b/gcc/tree-ssa-loop-ivopts.cc
> > @@ -1184,23 +1184,41 @@ static struct iv *
> >  alloc_iv (struct ivopts_data *data, tree base, tree step,
> >       bool no_overflow = false)
> >  {
> > -  tree expr = base;
> >    struct iv *iv = (struct iv*) obstack_alloc (&data->iv_obstack,
> >                                           sizeof (struct iv));
> >    gcc_assert (step != NULL_TREE);
> >
> > -  /* Lower address expression in base except ones with DECL_P as operand.
> > -     By doing this:
> > +  /* Lower address expression in base by folding if possible while 
> > maintaining
> > +     the original type of the expression.  Any ADDR_EXPR should not be 
> > folded as
> > +     we will end up folding this into an integer and losing the fact it's 
> > an
> > +     address.
> > +     By doing this folding we gain:
> >         1) More accurate cost can be computed for address expressions;
> >         2) Duplicate candidates won't be created for bases in different
> >       forms, like &a[0] and &a.  */
> > +  tree expr = base;
> >    STRIP_NOPS (expr);
> > -  if ((TREE_CODE (expr) == ADDR_EXPR && !DECL_P (TREE_OPERAND (expr, 0)))
> > -      || contain_complex_addr_expr (expr))
> > +  if (TREE_CODE (expr) != ADDR_EXPR)
> 
> So this runs the conversion for some additional cases but not for some
> cases it ran before.

Yeah, Those cases should probably be folded as they were before. So I think 
this should be
kept as an else.

> 
> IIRC the original purpose is to "expand" &a[i] to &a + CST * i and
> the guard is to avoid churn on trees otherwise.
> 
> Your goal is to convert '(unsigned)i' to 'i'.

Don't you mean the opposite? Fold 'I' into '(unsigned)i'?
> 
> But definitely TREE_CODE (expr) != ADDR_EXPR is odd since I don't
> see the difference between &x and &x + 1.
> 

The issue isn't this, but folding &a[i] into (unsigned)&a + i leads to 
incorrect code.
The original fold never changed the sign.

> >      {
> >        aff_tree comb;
> > -      tree_to_aff_combination (expr, TREE_TYPE (expr), &comb);
> > -      base = fold_convert (TREE_TYPE (base), aff_combination_to_tree 
> > (&comb));
> > +      tree addr = NULL_TREE;
> > +      /* Contrary to the comment above, IVopts does not deal well with 
> > folding
> > +    address the pointers.
> 
> "folding address the pointers"?
> 
> >  We avoid folding ADDR_EXPR but when we have a
> > +    POINTER_PLUS_EXPR we want to keep the base the same but still fold the
> > +    offset.  */
> 
> But oddly we did this folding before?  Btw, your patch removes the
> only use of contain_complex_addr_expr but not the function itself.
> 

No, the original fold preserves the same sign of the expression.  Which is fine.
This now always folds to unsigned.  So I think like we want that if we can't 
fold
to unsigned, we should fold using the original sign still to canonicalize these
address expressions?  That should cover both cases.

> So you're talking about the case where 'base' has a pointer-to-integer
> or integer-to-pointer conversion which STRIP_NOPS strips?
> 
> > +      if (TREE_CODE (expr) == POINTER_PLUS_EXPR)
> > +   {
> > +     addr = TREE_OPERAND (expr, 0);
> > +     expr = TREE_OPERAND (expr, 1);
> 
> but expr should be unsigned already, see expr_to_aff_combination
>

So I definitely see cases where expr is unsigned, but has not been folded as 
unsigned.
i.e. the simplification of the expression as unsigned was never performed, only 
a cast
added.  I don't see the other conversion to unsigned happening to an affine 
transform.

I only see calls to fold_convert.

> > +   }
> > +
> > +      tree utype = unsigned_type_for (TREE_TYPE (expr));
> > +      tree_to_aff_combination (expr, utype, &comb);
> 
> I don't think tree_to_aff_combination is supposed to "convert" on the
> fly.  There is aff_combination_convert for this.
> 

Hmm I though the actual conversion happens in aff_combination_to_tree
with tree_to_aff_combination just setting the desired type.

It looks like aff_combination_convert does the same, except it round trips it
back to an aff_tree,  but I don't need the round tripping.  So I can use it, but
it seems more expensive...

Thanks for the review,
Tamar

> > +      expr = aff_combination_to_tree (&comb);
> > +      if (addr)
> > +   expr = fold_build_pointer_plus (addr, expr);
> > +
> > +      base = fold_convert (TREE_TYPE (base), expr);
> >      }
> >
> >    iv->base = base;
> >
> 
> --
> Richard Biener <rguent...@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

RE: [PATCH][ivopts]: perform affine fold on unsigned addressing modes known not to overflow. [PR114932]

Reply via email to