On Wed, 19 Jun 2024, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguent...@suse.de>
> > Sent: Wednesday, June 19, 2024 1:14 PM
> > To: Tamar Christina <tamar.christ...@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; bin.ch...@linux.alibaba.com
> > Subject: Re: [PATCH][ivopts]: perform affine fold on unsigned addressing 
> > modes
> > known not to overflow. [PR114932]
> > 
> > On Fri, 14 Jun 2024, Tamar Christina wrote:
> > 
> > > Hi All,
> > >
> > > When the patch for PR114074 was applied we saw a good boost in exchange2.
> > >
> > > This boost was partially caused by a simplification of the addressing 
> > > modes.
> > > With the patch applied IV opts saw the following form for the base 
> > > addressing;
> > >
> > >   Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) 
> > > l0_19(D) *
> > > 324) + 36)
> > >
> > > vs what we normally get:
> > >
> > >   Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) 
> > > l0_19(D)
> > > * 81) + 9) * 4
> > >
> > > This is because the patch promoted multiplies where one operand is a 
> > > constant
> > > from a signed multiply to an unsigned one, to attempt to fold away the 
> > > constant.
> > >
> > > This patch attempts the same but due to the various problems with SCEV and
> > > niters not being able to analyze the resulting forms (i.e. PR114322) we 
> > > can't
> > > do it during SCEV or in the general form like in fold-const like 
> > > extract_muldiv
> > > attempts.
> > >
> > > Instead this applies the simplification during IVopts initialization when 
> > > we
> > > create the IV.  Essentially when we know the IV won't overflow with 
> > > regards to
> > > niters then we perform an affine fold which gets it to simplify the 
> > > internal
> > > computation, even if this is signed because we know that for IVOPTs uses 
> > > the
> > > IV won't ever overflow.  This allows IV opts to see the simplified form
> > > without influencing the rest of the compiler.
> > >
> > > as mentioned in PR114074 it would be good to fix the missed optimization 
> > > in the
> > > other passes so we can perform this in general.
> > >
> > > The reason this has a big impact on fortran code is that fortran doesn't 
> > > seem to
> > > have unsigned integer types.  As such all it's addressing are created with
> > > signed types and folding does not happen on them due to the possible 
> > > overflow.
> > >
> > > concretely on AArch64 this changes the results from generation:
> > >
> > >         mov     x27, -108
> > >         mov     x24, -72
> > >         mov     x23, -36
> > >         add     x21, x1, x0, lsl 2
> > >         add     x19, x20, x22
> > > .L5:
> > >         add     x0, x22, x19
> > >         add     x19, x19, 324
> > >         ldr     d1, [x0, x27]
> > >         add     v1.2s, v1.2s, v15.2s
> > >         str     d1, [x20, 216]
> > >         ldr     d0, [x0, x24]
> > >         add     v0.2s, v0.2s, v15.2s
> > >         str     d0, [x20, 252]
> > >         ldr     d31, [x0, x23]
> > >         add     v31.2s, v31.2s, v15.2s
> > >         str     d31, [x20, 288]
> > >         bl      digits_20_
> > >         cmp     x21, x19
> > >         bne     .L5
> > >
> > > into:
> > >
> > > .L5:
> > >         ldr     d1, [x19, -108]
> > >         add     v1.2s, v1.2s, v15.2s
> > >         str     d1, [x20, 216]
> > >         ldr     d0, [x19, -72]
> > >         add     v0.2s, v0.2s, v15.2s
> > >         str     d0, [x20, 252]
> > >         ldr     d31, [x19, -36]
> > >         add     x19, x19, 324
> > >         add     v31.2s, v31.2s, v15.2s
> > >         str     d31, [x20, 288]
> > >         bl      digits_20_
> > >         cmp     x21, x19
> > >         bne     .L5
> > >
> > > The two patches together results in a 10% performance increase in 
> > > exchange2 in
> > > SPECCPU 2017 and a 4% reduction in binary size and a 5% improvement in
> > compile
> > > time. There's also a 5% performance improvement in fotonik3d and similar
> > > reduction in binary size.
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > >
> > > Ok for master?
> > >
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > >   PR tree-optimization/114932
> > >   * tree-scalar-evolution.cc (alloc_iv): Perform affine unsigned fold.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >   PR tree-optimization/114932
> > >   * gfortran.dg/addressing-modes_1.f90: New test.
> > >
> > > ---
> > > diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> > b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> > > new file mode 100644
> > > index
> > 0000000000000000000000000000000000000000..334d5bc47a16e53e9168b
> > b1f90dfeff584b4e494
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> > > @@ -0,0 +1,37 @@
> > > +! { dg-do compile { target aarch64-*-* } }
> > > +! { dg-additional-options "-w -Ofast" }
> > > +
> > > +  module brute_force
> > > +    integer, parameter :: r=9
> > > +     integer  block(r, r, 0)
> > > +    contains
> > > +  subroutine brute
> > > +     do
> > > +      do
> > > +          do
> > > +           do
> > > +                do
> > > +                     do
> > > +                         do i7 = l0, 1
> > > +                       select case(1 )
> > > +                       case(1)
> > > +                           block(:2, 7:, 1) = block(:2, 7:, i7) - 1
> > > +                       end select
> > > +                            do i8 = 1, 1
> > > +                               do i9 = 1, 1
> > > +                            if(1 == 1) then
> > > +                                    call digits_20
> > > +                                end if
> > > +                                end do
> > > +                          end do
> > > +                    end do
> > > +                    end do
> > > +              end do
> > > +              end do
> > > +           end do
> > > +     end do
> > > +  end do
> > > + end
> > > +  end
> > > +
> > > +! { dg-final { scan-assembler-not {ldr\s+d([0-9]+),\s+\[x[0-9]+, 
> > > x[0-9]+\]} } }
> > > diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
> > > index
> > 4338d7b64a6c2df6404b8d5e51c7f62c23006e72..f621e4ee924b930e1e1d68e3
> > 5f3d3a0d52470811 100644
> > > --- a/gcc/tree-ssa-loop-ivopts.cc
> > > +++ b/gcc/tree-ssa-loop-ivopts.cc
> > > @@ -1216,6 +1216,18 @@ alloc_iv (struct ivopts_data *data, tree base, tree
> > step,
> > >        base = fold_convert (TREE_TYPE (base), aff_combination_to_tree 
> > > (&comb));
> > >      }
> > >
> > > +  /* If we know the IV won't overflow wrt niters and the type is an 
> > > unsigned
> > > +     type then fold using affine unsigned arithmetic to allow more 
> > > folding of
> > > +     constants.  */
> > > +  if (no_overflow
> > > +      && TYPE_UNSIGNED (TREE_TYPE (expr)))
> > > +    {
> > > +      aff_tree comb;
> > > +      tree utype = unsigned_type_for (TREE_TYPE (expr));
> > > +      tree_to_aff_combination (expr, utype, &comb);
> > > +      base = fold_convert (TREE_TYPE (base), aff_combination_to_tree 
> > > (&comb));
> > > +    }
> > > +
> > 
> > So right above we already do
> > 
> >   /* Lower address expression in base except ones with DECL_P as operand.
> >      By doing this:
> >        1) More accurate cost can be computed for address expressions;
> >        2) Duplicate candidates won't be created for bases in different
> >           forms, like &a[0] and &a.  */
> >   STRIP_NOPS (expr);
> >   if ((TREE_CODE (expr) == ADDR_EXPR && !DECL_P (TREE_OPERAND (expr, 0)))
> >       || contain_complex_addr_expr (expr))
> >     {
> >       aff_tree comb;
> >       tree_to_aff_combination (expr, TREE_TYPE (expr), &comb);
> >       base = fold_convert (TREE_TYPE (base), aff_combination_to_tree
> > (&comb));
> >     }
> > 
> > and if I read correctly 'expr' is
> > 
> >   (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
> > * 81) + 9) * 4
> > 
> > in your interesting case which means it doesn't satisfy
> > contain_complex_addr_expr.
> > 
> > I don't quite get why rewriting the base into (T)(unsigned)... is
> > only valid when no_overflow - no_overflow is about {base, +, step},
> > not about any overflow contained in 'base'.
> > 
> 
> Right, so you're saying overflow only applies to step and it's final
> evolution wrt to niters? Ok, makes sense.  I had thought it referred
> to that the final possible address doesn't overflow. So offset + step * 
> niters.

It refers to the IV itself not wrapping but 'base' is the initial
value computation, whether that computation itself wraps or not
(it would wrap the same for each iteration) isn't covered.

> > I wonder if we maybe want to record an "original" iv->base
> > to be used for code generation (and there only the unexpanded form)
> > and a variant used for the various sorts of canonicalization/compare
> > (I see we eventually add/subtract step and then compare against
> > sth else).  And then apply this normalization always to the not
> > "original" form.
> 
> That's certainly possible, but it'll make the code a bit harder to follow in 
> the
> cases where the comparisons are done between the split address base + step
> and then a new IV built up from it.
> 
> In those cases we'd have to split both expressions but used different ones in
> the right places.  Where it gets messy is that something it updates the value
> and then compares.
> 
> So if we can avoid it it's likely cleaner.

Sure.

> > 
> > The above STRIP_NOPS (expr) + expand might turn an unsigned
> > affine combination into a signed one which might be problematic.
> > So what happens if you change the above to simply always
> > unsigned expand?
> > 
> 
> Yes, that's what's currently happening. It's stripping (signed) (a + b) into
> a + b.
> 
> To be clear you mean change the affine fold I added to always be unsigned?
> and thus removing the contain_complex_addr_expr chunk?

Yes.

> That works and was my first patch locally (though need to fully reg-test it).

I'm curious what happens.

Btw, I think we want to go very piece-mail with these changes given
there's nobody who's really familiar with the code.

Richard.

Reply via email to