RE: [PATCH][ivopts]: perform affine fold on unsigned addressing modes known not to overflow. [PR114932]

Tamar Christina Wed, 10 Jul 2024 03:15:57 -0700

> -----Original Message-----
> From: Richard Biener <rguent...@suse.de>
> Sent: Thursday, June 20, 2024 8:55 AM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; bin.ch...@linux.alibaba.com
> Subject: RE: [PATCH][ivopts]: perform affine fold on unsigned addressing modes
> known not to overflow. [PR114932]
> 
> On Wed, 19 Jun 2024, Tamar Christina wrote:
> 
> > > -----Original Message-----
> > > From: Richard Biener <rguent...@suse.de>
> > > Sent: Wednesday, June 19, 2024 1:14 PM
> > > To: Tamar Christina <tamar.christ...@arm.com>
> > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>;
> bin.ch...@linux.alibaba.com
> > > Subject: Re: [PATCH][ivopts]: perform affine fold on unsigned addressing
> modes
> > > known not to overflow. [PR114932]
> > >
> > > On Fri, 14 Jun 2024, Tamar Christina wrote:
> > >
> > > > Hi All,
> > > >
> > > > When the patch for PR114074 was applied we saw a good boost in
> exchange2.
> > > >
> > > > This boost was partially caused by a simplification of the addressing 
> > > > modes.
> > > > With the patch applied IV opts saw the following form for the base
> addressing;
> > > >
> > > >   Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) 
> > > > l0_19(D) *
> > > > 324) + 36)
> > > >
> > > > vs what we normally get:
> > > >
> > > >   Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) 
> > > > l0_19(D)
> > > > * 81) + 9) * 4
> > > >
> > > > This is because the patch promoted multiplies where one operand is a
> constant
> > > > from a signed multiply to an unsigned one, to attempt to fold away the
> constant.
> > > >
> > > > This patch attempts the same but due to the various problems with SCEV 
> > > > and
> > > > niters not being able to analyze the resulting forms (i.e. PR114322) we 
> > > > can't
> > > > do it during SCEV or in the general form like in fold-const like 
> > > > extract_muldiv
> > > > attempts.
> > > >
> > > > Instead this applies the simplification during IVopts initialization 
> > > > when we
> > > > create the IV.  Essentially when we know the IV won't overflow with 
> > > > regards
> to
> > > > niters then we perform an affine fold which gets it to simplify the 
> > > > internal
> > > > computation, even if this is signed because we know that for IVOPTs 
> > > > uses the
> > > > IV won't ever overflow.  This allows IV opts to see the simplified form
> > > > without influencing the rest of the compiler.
> > > >
> > > > as mentioned in PR114074 it would be good to fix the missed 
> > > > optimization in
> the
> > > > other passes so we can perform this in general.
> > > >
> > > > The reason this has a big impact on fortran code is that fortran 
> > > > doesn't seem
> to
> > > > have unsigned integer types.  As such all it's addressing are created 
> > > > with
> > > > signed types and folding does not happen on them due to the possible
> overflow.
> > > >
> > > > concretely on AArch64 this changes the results from generation:
> > > >
> > > >         mov     x27, -108
> > > >         mov     x24, -72
> > > >         mov     x23, -36
> > > >         add     x21, x1, x0, lsl 2
> > > >         add     x19, x20, x22
> > > > .L5:
> > > >         add     x0, x22, x19
> > > >         add     x19, x19, 324
> > > >         ldr     d1, [x0, x27]
> > > >         add     v1.2s, v1.2s, v15.2s
> > > >         str     d1, [x20, 216]
> > > >         ldr     d0, [x0, x24]
> > > >         add     v0.2s, v0.2s, v15.2s
> > > >         str     d0, [x20, 252]
> > > >         ldr     d31, [x0, x23]
> > > >         add     v31.2s, v31.2s, v15.2s
> > > >         str     d31, [x20, 288]
> > > >         bl      digits_20_
> > > >         cmp     x21, x19
> > > >         bne     .L5
> > > >
> > > > into:
> > > >
> > > > .L5:
> > > >         ldr     d1, [x19, -108]
> > > >         add     v1.2s, v1.2s, v15.2s
> > > >         str     d1, [x20, 216]
> > > >         ldr     d0, [x19, -72]
> > > >         add     v0.2s, v0.2s, v15.2s
> > > >         str     d0, [x20, 252]
> > > >         ldr     d31, [x19, -36]
> > > >         add     x19, x19, 324
> > > >         add     v31.2s, v31.2s, v15.2s
> > > >         str     d31, [x20, 288]
> > > >         bl      digits_20_
> > > >         cmp     x21, x19
> > > >         bne     .L5
> > > >
> > > > The two patches together results in a 10% performance increase in 
> > > > exchange2
> in
> > > > SPECCPU 2017 and a 4% reduction in binary size and a 5% improvement in
> > > compile
> > > > time. There's also a 5% performance improvement in fotonik3d and similar
> > > > reduction in binary size.
> > > >
> > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > > >
> > > > Ok for master?
> > > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >         PR tree-optimization/114932
> > > >         * tree-scalar-evolution.cc (alloc_iv): Perform affine unsigned 
> > > > fold.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >         PR tree-optimization/114932
> > > >         * gfortran.dg/addressing-modes_1.f90: New test.
> > > >
> > > > ---
> > > > diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> > > b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> > > > new file mode 100644
> > > > index
> > >
> 0000000000000000000000000000000000000000..334d5bc47a16e53e9168b
> > > b1f90dfeff584b4e494
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> > > > @@ -0,0 +1,37 @@
> > > > +! { dg-do compile { target aarch64-*-* } }
> > > > +! { dg-additional-options "-w -Ofast" }
> > > > +
> > > > +  module brute_force
> > > > +    integer, parameter :: r=9
> > > > +     integer  block(r, r, 0)
> > > > +    contains
> > > > +  subroutine brute
> > > > +     do
> > > > +      do
> > > > +          do
> > > > +           do
> > > > +                do
> > > > +                     do
> > > > +                         do i7 = l0, 1
> > > > +                       select case(1 )
> > > > +                       case(1)
> > > > +                           block(:2, 7:, 1) = block(:2, 7:, i7) - 1
> > > > +                       end select
> > > > +                            do i8 = 1, 1
> > > > +                               do i9 = 1, 1
> > > > +                            if(1 == 1) then
> > > > +                                    call digits_20
> > > > +                                end if
> > > > +                                end do
> > > > +                          end do
> > > > +                    end do
> > > > +                    end do
> > > > +              end do
> > > > +              end do
> > > > +           end do
> > > > +     end do
> > > > +  end do
> > > > + end
> > > > +  end
> > > > +
> > > > +! { dg-final { scan-assembler-not {ldr\s+d([0-9]+),\s+\[x[0-9]+, 
> > > > x[0-9]+\]} } }
> > > > diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
> > > > index
> > >
> 4338d7b64a6c2df6404b8d5e51c7f62c23006e72..f621e4ee924b930e1e1d68e3
> > > 5f3d3a0d52470811 100644
> > > > --- a/gcc/tree-ssa-loop-ivopts.cc
> > > > +++ b/gcc/tree-ssa-loop-ivopts.cc
> > > > @@ -1216,6 +1216,18 @@ alloc_iv (struct ivopts_data *data, tree base, 
> > > > tree
> > > step,
> > > >        base = fold_convert (TREE_TYPE (base), aff_combination_to_tree
> (&comb));
> > > >      }
> > > >
> > > > +  /* If we know the IV won't overflow wrt niters and the type is an 
> > > > unsigned
> > > > +     type then fold using affine unsigned arithmetic to allow more 
> > > > folding of
> > > > +     constants.  */
> > > > +  if (no_overflow
> > > > +      && TYPE_UNSIGNED (TREE_TYPE (expr)))
> > > > +    {
> > > > +      aff_tree comb;
> > > > +      tree utype = unsigned_type_for (TREE_TYPE (expr));
> > > > +      tree_to_aff_combination (expr, utype, &comb);
> > > > +      base = fold_convert (TREE_TYPE (base), aff_combination_to_tree
> (&comb));
> > > > +    }
> > > > +
> > >
> > > So right above we already do
> > >
> > >   /* Lower address expression in base except ones with DECL_P as operand.
> > >      By doing this:
> > >        1) More accurate cost can be computed for address expressions;
> > >        2) Duplicate candidates won't be created for bases in different
> > >           forms, like &a[0] and &a.  */
> > >   STRIP_NOPS (expr);
> > >   if ((TREE_CODE (expr) == ADDR_EXPR && !DECL_P (TREE_OPERAND (expr, 0)))
> > >       || contain_complex_addr_expr (expr))
> > >     {
> > >       aff_tree comb;
> > >       tree_to_aff_combination (expr, TREE_TYPE (expr), &comb);
> > >       base = fold_convert (TREE_TYPE (base), aff_combination_to_tree
> > > (&comb));
> > >     }
> > >
> > > and if I read correctly 'expr' is
> > >
> > >   (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
> > > * 81) + 9) * 4
> > >
> > > in your interesting case which means it doesn't satisfy
> > > contain_complex_addr_expr.
> > >
> > > I don't quite get why rewriting the base into (T)(unsigned)... is
> > > only valid when no_overflow - no_overflow is about {base, +, step},
> > > not about any overflow contained in 'base'.
> > >
> >
> > Right, so you're saying overflow only applies to step and it's final
> > evolution wrt to niters? Ok, makes sense.  I had thought it referred
> > to that the final possible address doesn't overflow. So offset + step * 
> > niters.
> 
> It refers to the IV itself not wrapping but 'base' is the initial
> value computation, whether that computation itself wraps or not
> (it would wrap the same for each iteration) isn't covered.
> 
> > > I wonder if we maybe want to record an "original" iv->base
> > > to be used for code generation (and there only the unexpanded form)
> > > and a variant used for the various sorts of canonicalization/compare
> > > (I see we eventually add/subtract step and then compare against
> > > sth else).  And then apply this normalization always to the not
> > > "original" form.
> >
> > That's certainly possible, but it'll make the code a bit harder to follow 
> > in the
> > cases where the comparisons are done between the split address base + step
> > and then a new IV built up from it.
> >
> > In those cases we'd have to split both expressions but used different ones 
> > in
> > the right places.  Where it gets messy is that something it updates the 
> > value
> > and then compares.
> >
> > So if we can avoid it it's likely cleaner.
> 
> Sure.
> 
> > >
> > > The above STRIP_NOPS (expr) + expand might turn an unsigned
> > > affine combination into a signed one which might be problematic.
> > > So what happens if you change the above to simply always
> > > unsigned expand?
> > >
> >
> > Yes, that's what's currently happening. It's stripping (signed) (a + b) into
> > a + b.
> >
> > To be clear you mean change the affine fold I added to always be unsigned?
> > and thus removing the contain_complex_addr_expr chunk?
> 
> Yes.
> 
> > That works and was my first patch locally (though need to fully reg-test 
> > it).
> 
> I'm curious what happens.
>


Ok, We cannot fold unconditionally to unsigned, however this I believe is 
correct:

The patch folds every IV to unsigned to canonicalize them.  At the end of the
pass we match.pd will then remove unneeded conversions.

Note that we cannot force everything to unsigned, IVops requires that array
address expressions remain as such.  Folding them results in them becoming
pointer expressions for which some optimizations in IVopts do not run.

If we fold array expressions it starts treating them normal integers and leads
to incorrect code (as shown by testsuite/gcc.dg/torture/bitint-49.c).

And if we fold POINTER_PLUS_EXPR then this leads to inefficient codegen as is
shown by testsuite/gcc.target/i386/pr115462.c.  So for POINTER_PLUS_EXPR we can
fold the offset but not the base.

Bootstrapped Regtested on aarch64-none-linux-gnu,
x86_64-pc-linux-gnu -m32, -m64 and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

        PR tree-optimization/114932
        * tree-scalar-evolution.cc (alloc_iv): Perform affine unsigned fold.

gcc/testsuite/ChangeLog:

        PR tree-optimization/114932
        * gcc.dg/tree-ssa/pr64705.c: Update dump file scan.
        * gfortran.dg/addressing-modes_1.f90: New test.

-- inline copy of patch --

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c
index 
fd24e38a53e9f3a4659dece5af4d71fbc0ce2c18..3c9c2e5deed1ba755d1fae15a6553d68b6ad0098
 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c
@@ -23,4 +23,4 @@ int foo(char *flags, long len, long i, long steps)
 
 /* Don't expand iv {base+step, step}_loop into {base+x+y, step}_loop
    even if "step == x + y".  */
-/* { dg-final { scan-tree-dump "Base:\\tstep_\[0-9\]* \\+ 
iter|Base:\\titer_\[0-9\]* \\+ step" "ivopts"} } */
+/* { dg-final { scan-tree-dump {Base:\t\(unsigned ?.*\) step_[0-9]* \+ 
\(unsigned ?.*\) iter|Base:\t\(unsigned ?.*\) iter_[0-9]* \+ \(unsigned ?.*\) 
step} "ivopts"} } */
diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_1.f90 
b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
new file mode 100644
index 
0000000000000000000000000000000000000000..334d5bc47a16e53e9168bb1f90dfeff584b4e494
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
@@ -0,0 +1,37 @@
+! { dg-do compile { target aarch64-*-* } }
+! { dg-additional-options "-w -Ofast" }
+
+  module brute_force
+    integer, parameter :: r=9
+     integer  block(r, r, 0)
+    contains
+  subroutine brute
+     do
+      do
+          do
+           do
+                do
+                     do
+                         do i7 = l0, 1
+                       select case(1 )
+                       case(1)
+                           block(:2, 7:, 1) = block(:2, 7:, i7) - 1
+                       end select
+                            do i8 = 1, 1
+                               do i9 = 1, 1
+                            if(1 == 1) then
+                                    call digits_20
+                                end if
+                                end do
+                          end do
+                    end do
+                    end do
+              end do
+              end do
+           end do
+     end do
+  end do
+ end
+  end
+
+! { dg-final { scan-assembler-not {ldr\s+d([0-9]+),\s+\[x[0-9]+, x[0-9]+\]} } }
diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
index 
b05e4ac1e63b5c110707a8a3ed5e614b7ccc702f..5fc188ae3f8207943a9503e2dc51b067c3f85739
 100644
--- a/gcc/tree-ssa-loop-ivopts.cc
+++ b/gcc/tree-ssa-loop-ivopts.cc
@@ -1184,23 +1184,41 @@ static struct iv *
 alloc_iv (struct ivopts_data *data, tree base, tree step,
          bool no_overflow = false)
 {
-  tree expr = base;
   struct iv *iv = (struct iv*) obstack_alloc (&data->iv_obstack,
                                              sizeof (struct iv));
   gcc_assert (step != NULL_TREE);
 
-  /* Lower address expression in base except ones with DECL_P as operand.
-     By doing this:
+  /* Lower address expression in base by folding if possible while maintaining
+     the original type of the expression.  Any ADDR_EXPR should not be folded 
as
+     we will end up folding this into an integer and losing the fact it's an
+     address.
+     By doing this folding we gain:
        1) More accurate cost can be computed for address expressions;
        2) Duplicate candidates won't be created for bases in different
          forms, like &a[0] and &a.  */
+  tree expr = base;
   STRIP_NOPS (expr);
-  if ((TREE_CODE (expr) == ADDR_EXPR && !DECL_P (TREE_OPERAND (expr, 0)))
-      || contain_complex_addr_expr (expr))
+  if (TREE_CODE (expr) != ADDR_EXPR)
     {
       aff_tree comb;
-      tree_to_aff_combination (expr, TREE_TYPE (expr), &comb);
-      base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
+      tree addr = NULL_TREE;
+      /* Contrary to the comment above, IVopts does not deal well with folding
+        address the pointers.  We avoid folding ADDR_EXPR but when we have a
+        POINTER_PLUS_EXPR we want to keep the base the same but still fold the
+        offset.  */
+      if (TREE_CODE (expr) == POINTER_PLUS_EXPR)
+       {
+         addr = TREE_OPERAND (expr, 0);
+         expr = TREE_OPERAND (expr, 1);
+       }
+
+      tree utype = unsigned_type_for (TREE_TYPE (expr));
+      tree_to_aff_combination (expr, utype, &comb);
+      expr = aff_combination_to_tree (&comb);
+      if (addr)
+       expr = fold_build_pointer_plus (addr, expr);
+
+      base = fold_convert (TREE_TYPE (base), expr);
     }
 
   iv->base = base;

rb18487.patch
Description: rb18487.patch

RE: [PATCH][ivopts]: perform affine fold on unsigned addressing modes known not to overflow. [PR114932]

Reply via email to