-----Original Message----- > From: Richard Biener <rguent...@suse.de> > Sent: Thursday, July 11, 2024 12:39 PM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; bin.ch...@linux.alibaba.com > Subject: RE: [PATCH][ivopts]: perform affine fold on unsigned addressing modes > known not to overflow. [PR114932] > > On Wed, 10 Jul 2024, Tamar Christina wrote: > > > > -----Original Message----- > > > From: Richard Biener <rguent...@suse.de> > > > Sent: Thursday, June 20, 2024 8:55 AM > > > To: Tamar Christina <tamar.christ...@arm.com> > > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; > bin.ch...@linux.alibaba.com > > > Subject: RE: [PATCH][ivopts]: perform affine fold on unsigned addressing > modes > > > known not to overflow. [PR114932] > > > > > > On Wed, 19 Jun 2024, Tamar Christina wrote: > > > > > > > > -----Original Message----- > > > > > From: Richard Biener <rguent...@suse.de> > > > > > Sent: Wednesday, June 19, 2024 1:14 PM > > > > > To: Tamar Christina <tamar.christ...@arm.com> > > > > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; > > > bin.ch...@linux.alibaba.com > > > > > Subject: Re: [PATCH][ivopts]: perform affine fold on unsigned > > > > > addressing > > > modes > > > > > known not to overflow. [PR114932] > > > > > > > > > > On Fri, 14 Jun 2024, Tamar Christina wrote: > > > > > > > > > > > Hi All, > > > > > > > > > > > > When the patch for PR114074 was applied we saw a good boost in > > > exchange2. > > > > > > > > > > > > This boost was partially caused by a simplification of the > > > > > > addressing > modes. > > > > > > With the patch applied IV opts saw the following form for the base > > > addressing; > > > > > > > > > > > > Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) > > > > > > l0_19(D) * > > > > > > 324) + 36) > > > > > > > > > > > > vs what we normally get: > > > > > > > > > > > > Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) > > > > > > l0_19(D) > > > > > > * 81) + 9) * 4 > > > > > > > > > > > > This is because the patch promoted multiplies where one operand is a > > > constant > > > > > > from a signed multiply to an unsigned one, to attempt to fold away > > > > > > the > > > constant. > > > > > > > > > > > > This patch attempts the same but due to the various problems with > > > > > > SCEV > and > > > > > > niters not being able to analyze the resulting forms (i.e. > > > > > > PR114322) we > can't > > > > > > do it during SCEV or in the general form like in fold-const like > extract_muldiv > > > > > > attempts. > > > > > > > > > > > > Instead this applies the simplification during IVopts > > > > > > initialization when we > > > > > > create the IV. Essentially when we know the IV won't overflow with > regards > > > to > > > > > > niters then we perform an affine fold which gets it to simplify the > > > > > > internal > > > > > > computation, even if this is signed because we know that for IVOPTs > > > > > > uses > the > > > > > > IV won't ever overflow. This allows IV opts to see the simplified > > > > > > form > > > > > > without influencing the rest of the compiler. > > > > > > > > > > > > as mentioned in PR114074 it would be good to fix the missed > optimization in > > > the > > > > > > other passes so we can perform this in general. > > > > > > > > > > > > The reason this has a big impact on fortran code is that fortran > > > > > > doesn't > seem > > > to > > > > > > have unsigned integer types. As such all it's addressing are > > > > > > created with > > > > > > signed types and folding does not happen on them due to the possible > > > overflow. > > > > > > > > > > > > concretely on AArch64 this changes the results from generation: > > > > > > > > > > > > mov x27, -108 > > > > > > mov x24, -72 > > > > > > mov x23, -36 > > > > > > add x21, x1, x0, lsl 2 > > > > > > add x19, x20, x22 > > > > > > .L5: > > > > > > add x0, x22, x19 > > > > > > add x19, x19, 324 > > > > > > ldr d1, [x0, x27] > > > > > > add v1.2s, v1.2s, v15.2s > > > > > > str d1, [x20, 216] > > > > > > ldr d0, [x0, x24] > > > > > > add v0.2s, v0.2s, v15.2s > > > > > > str d0, [x20, 252] > > > > > > ldr d31, [x0, x23] > > > > > > add v31.2s, v31.2s, v15.2s > > > > > > str d31, [x20, 288] > > > > > > bl digits_20_ > > > > > > cmp x21, x19 > > > > > > bne .L5 > > > > > > > > > > > > into: > > > > > > > > > > > > .L5: > > > > > > ldr d1, [x19, -108] > > > > > > add v1.2s, v1.2s, v15.2s > > > > > > str d1, [x20, 216] > > > > > > ldr d0, [x19, -72] > > > > > > add v0.2s, v0.2s, v15.2s > > > > > > str d0, [x20, 252] > > > > > > ldr d31, [x19, -36] > > > > > > add x19, x19, 324 > > > > > > add v31.2s, v31.2s, v15.2s > > > > > > str d31, [x20, 288] > > > > > > bl digits_20_ > > > > > > cmp x21, x19 > > > > > > bne .L5 > > > > > > > > > > > > The two patches together results in a 10% performance increase in > exchange2 > > > in > > > > > > SPECCPU 2017 and a 4% reduction in binary size and a 5% improvement > in > > > > > compile > > > > > > time. There's also a 5% performance improvement in fotonik3d and > > > > > > similar > > > > > > reduction in binary size. > > > > > > > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > > > > > > > > > > > Ok for master? > > > > > > > > > > > > Thanks, > > > > > > Tamar > > > > > > > > > > > > gcc/ChangeLog: > > > > > > > > > > > > PR tree-optimization/114932 > > > > > > * tree-scalar-evolution.cc (alloc_iv): Perform affine unsigned > > > > > > fold. > > > > > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > > > > > PR tree-optimization/114932 > > > > > > * gfortran.dg/addressing-modes_1.f90: New test. > > > > > > > > > > > > --- > > > > > > diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_1.f90 > > > > > b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90 > > > > > > new file mode 100644 > > > > > > index > > > > > > > > > 0000000000000000000000000000000000000000..334d5bc47a16e53e9168b > > > > > b1f90dfeff584b4e494 > > > > > > --- /dev/null > > > > > > +++ b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90 > > > > > > @@ -0,0 +1,37 @@ > > > > > > +! { dg-do compile { target aarch64-*-* } } > > > > > > +! { dg-additional-options "-w -Ofast" } > > > > > > + > > > > > > + module brute_force > > > > > > + integer, parameter :: r=9 > > > > > > + integer block(r, r, 0) > > > > > > + contains > > > > > > + subroutine brute > > > > > > + do > > > > > > + do > > > > > > + do > > > > > > + do > > > > > > + do > > > > > > + do > > > > > > + do i7 = l0, 1 > > > > > > + select case(1 ) > > > > > > + case(1) > > > > > > + block(:2, 7:, 1) = block(:2, 7:, i7) - 1 > > > > > > + end select > > > > > > + do i8 = 1, 1 > > > > > > + do i9 = 1, 1 > > > > > > + if(1 == 1) then > > > > > > + call digits_20 > > > > > > + end if > > > > > > + end do > > > > > > + end do > > > > > > + end do > > > > > > + end do > > > > > > + end do > > > > > > + end do > > > > > > + end do > > > > > > + end do > > > > > > + end do > > > > > > + end > > > > > > + end > > > > > > + > > > > > > +! { dg-final { scan-assembler-not {ldr\s+d([0-9]+),\s+\[x[0-9]+, > > > > > > x[0-9]+\]} > } } > > > > > > diff --git a/gcc/tree-ssa-loop-ivopts.cc > > > > > > b/gcc/tree-ssa-loop-ivopts.cc > > > > > > index > > > > > > > > > 4338d7b64a6c2df6404b8d5e51c7f62c23006e72..f621e4ee924b930e1e1d68e3 > > > > > 5f3d3a0d52470811 100644 > > > > > > --- a/gcc/tree-ssa-loop-ivopts.cc > > > > > > +++ b/gcc/tree-ssa-loop-ivopts.cc > > > > > > @@ -1216,6 +1216,18 @@ alloc_iv (struct ivopts_data *data, tree > > > > > > base, > tree > > > > > step, > > > > > > base = fold_convert (TREE_TYPE (base), > > > > > > aff_combination_to_tree > > > (&comb)); > > > > > > } > > > > > > > > > > > > + /* If we know the IV won't overflow wrt niters and the type is an > unsigned > > > > > > + type then fold using affine unsigned arithmetic to allow more > > > > > > folding of > > > > > > + constants. */ > > > > > > + if (no_overflow > > > > > > + && TYPE_UNSIGNED (TREE_TYPE (expr))) > > > > > > + { > > > > > > + aff_tree comb; > > > > > > + tree utype = unsigned_type_for (TREE_TYPE (expr)); > > > > > > + tree_to_aff_combination (expr, utype, &comb); > > > > > > + base = fold_convert (TREE_TYPE (base), > > > > > > aff_combination_to_tree > > > (&comb)); > > > > > > + } > > > > > > + > > > > > > > > > > So right above we already do > > > > > > > > > > /* Lower address expression in base except ones with DECL_P as > > > > > operand. > > > > > By doing this: > > > > > 1) More accurate cost can be computed for address expressions; > > > > > 2) Duplicate candidates won't be created for bases in different > > > > > forms, like &a[0] and &a. */ > > > > > STRIP_NOPS (expr); > > > > > if ((TREE_CODE (expr) == ADDR_EXPR && !DECL_P (TREE_OPERAND (expr, > 0))) > > > > > || contain_complex_addr_expr (expr)) > > > > > { > > > > > aff_tree comb; > > > > > tree_to_aff_combination (expr, TREE_TYPE (expr), &comb); > > > > > base = fold_convert (TREE_TYPE (base), aff_combination_to_tree > > > > > (&comb)); > > > > > } > > > > > > > > > > and if I read correctly 'expr' is > > > > > > > > > > (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D) > > > > > * 81) + 9) * 4 > > > > > > > > > > in your interesting case which means it doesn't satisfy > > > > > contain_complex_addr_expr. > > > > > > > > > > I don't quite get why rewriting the base into (T)(unsigned)... is > > > > > only valid when no_overflow - no_overflow is about {base, +, step}, > > > > > not about any overflow contained in 'base'. > > > > > > > > > > > > > Right, so you're saying overflow only applies to step and it's final > > > > evolution wrt to niters? Ok, makes sense. I had thought it referred > > > > to that the final possible address doesn't overflow. So offset + step * > > > > niters. > > > > > > It refers to the IV itself not wrapping but 'base' is the initial > > > value computation, whether that computation itself wraps or not > > > (it would wrap the same for each iteration) isn't covered. > > > > > > > > I wonder if we maybe want to record an "original" iv->base > > > > > to be used for code generation (and there only the unexpanded form) > > > > > and a variant used for the various sorts of canonicalization/compare > > > > > (I see we eventually add/subtract step and then compare against > > > > > sth else). And then apply this normalization always to the not > > > > > "original" form. > > > > > > > > That's certainly possible, but it'll make the code a bit harder to > > > > follow in the > > > > cases where the comparisons are done between the split address base + > > > > step > > > > and then a new IV built up from it. > > > > > > > > In those cases we'd have to split both expressions but used different > > > > ones in > > > > the right places. Where it gets messy is that something it updates the > > > > value > > > > and then compares. > > > > > > > > So if we can avoid it it's likely cleaner. > > > > > > Sure. > > > > > > > > > > > > > The above STRIP_NOPS (expr) + expand might turn an unsigned > > > > > affine combination into a signed one which might be problematic. > > > > > So what happens if you change the above to simply always > > > > > unsigned expand? > > > > > > > > > > > > > Yes, that's what's currently happening. It's stripping (signed) (a + b) > > > > into > > > > a + b. > > > > > > > > To be clear you mean change the affine fold I added to always be > > > > unsigned? > > > > and thus removing the contain_complex_addr_expr chunk? > > > > > > Yes. > > > > > > > That works and was my first patch locally (though need to fully > > > > reg-test it). > > > > > > I'm curious what happens. > > > > > > > Ok, We cannot fold unconditionally to unsigned, however this I believe is > > correct: > > > > The patch folds every IV to unsigned to canonicalize them. At the end of > > the > > pass we match.pd will then remove unneeded conversions. > > > > Note that we cannot force everything to unsigned, IVops requires that array > > address expressions remain as such. Folding them results in them becoming > > pointer expressions for which some optimizations in IVopts do not run. > > Isn't that what contain_complex_addr_expr tries to assert? That there > isn't a DECL in the expression?
Yes, but In this case, specifically for POINTER_PLUS_EXPR it's perfectly fine for The base to be a DECL as we won't be folding it. Though looking at this perhaps the same condition should be applied to PLUS_EXPR and MINUS_EXPR. So I think contain_complex_addr_expr should maybe not check operand0. > > > If we fold array expressions it starts treating them normal integers and > > leads > > to incorrect code (as shown by testsuite/gcc.dg/torture/bitint-49.c). > > > > And if we fold POINTER_PLUS_EXPR then this leads to inefficient codegen as > > is > > shown by testsuite/gcc.target/i386/pr115462.c. So for POINTER_PLUS_EXPR > we can > > fold the offset but not the base. > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, > > x86_64-pc-linux-gnu -m32, -m64 and no issues. > > > > Ok for master? > > > > Thanks, > > Tamar > > > > gcc/ChangeLog: > > > > PR tree-optimization/114932 > > * tree-scalar-evolution.cc (alloc_iv): Perform affine unsigned fold. > > > > gcc/testsuite/ChangeLog: > > > > PR tree-optimization/114932 > > * gcc.dg/tree-ssa/pr64705.c: Update dump file scan. > > * gfortran.dg/addressing-modes_1.f90: New test. > > > > -- inline copy of patch -- > > > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c > > b/gcc/testsuite/gcc.dg/tree- > ssa/pr64705.c > > index > fd24e38a53e9f3a4659dece5af4d71fbc0ce2c18..3c9c2e5deed1ba755d1fae15a6 > 553d68b6ad0098 100644 > > --- a/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c > > +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c > > @@ -23,4 +23,4 @@ int foo(char *flags, long len, long i, long steps) > > > > /* Don't expand iv {base+step, step}_loop into {base+x+y, step}_loop > > even if "step == x + y". */ > > -/* { dg-final { scan-tree-dump "Base:\\tstep_\[0-9\]* \\+ > > iter|Base:\\titer_\[0- > 9\]* \\+ step" "ivopts"} } */ > > +/* { dg-final { scan-tree-dump {Base:\t\(unsigned ?.*\) step_[0-9]* \+ > \(unsigned ?.*\) iter|Base:\t\(unsigned ?.*\) iter_[0-9]* \+ \(unsigned ?.*\) > step} > "ivopts"} } */ > > diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_1.f90 > b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90 > > new file mode 100644 > > index > 0000000000000000000000000000000000000000..334d5bc47a16e53e9168b > b1f90dfeff584b4e494 > > --- /dev/null > > +++ b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90 > > @@ -0,0 +1,37 @@ > > +! { dg-do compile { target aarch64-*-* } } > > +! { dg-additional-options "-w -Ofast" } > > + > > + module brute_force > > + integer, parameter :: r=9 > > + integer block(r, r, 0) > > + contains > > + subroutine brute > > + do > > + do > > + do > > + do > > + do > > + do > > + do i7 = l0, 1 > > + select case(1 ) > > + case(1) > > + block(:2, 7:, 1) = block(:2, 7:, i7) - 1 > > + end select > > + do i8 = 1, 1 > > + do i9 = 1, 1 > > + if(1 == 1) then > > + call digits_20 > > + end if > > + end do > > + end do > > + end do > > + end do > > + end do > > + end do > > + end do > > + end do > > + end do > > + end > > + end > > + > > +! { dg-final { scan-assembler-not {ldr\s+d([0-9]+),\s+\[x[0-9]+, > > x[0-9]+\]} } } > > diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc > > index > b05e4ac1e63b5c110707a8a3ed5e614b7ccc702f..5fc188ae3f8207943a9503e2d > c51b067c3f85739 100644 > > --- a/gcc/tree-ssa-loop-ivopts.cc > > +++ b/gcc/tree-ssa-loop-ivopts.cc > > @@ -1184,23 +1184,41 @@ static struct iv * > > alloc_iv (struct ivopts_data *data, tree base, tree step, > > bool no_overflow = false) > > { > > - tree expr = base; > > struct iv *iv = (struct iv*) obstack_alloc (&data->iv_obstack, > > sizeof (struct iv)); > > gcc_assert (step != NULL_TREE); > > > > - /* Lower address expression in base except ones with DECL_P as operand. > > - By doing this: > > + /* Lower address expression in base by folding if possible while > > maintaining > > + the original type of the expression. Any ADDR_EXPR should not be > > folded as > > + we will end up folding this into an integer and losing the fact it's > > an > > + address. > > + By doing this folding we gain: > > 1) More accurate cost can be computed for address expressions; > > 2) Duplicate candidates won't be created for bases in different > > forms, like &a[0] and &a. */ > > + tree expr = base; > > STRIP_NOPS (expr); > > - if ((TREE_CODE (expr) == ADDR_EXPR && !DECL_P (TREE_OPERAND (expr, 0))) > > - || contain_complex_addr_expr (expr)) > > + if (TREE_CODE (expr) != ADDR_EXPR) > > So this runs the conversion for some additional cases but not for some > cases it ran before. Yeah, Those cases should probably be folded as they were before. So I think this should be kept as an else. > > IIRC the original purpose is to "expand" &a[i] to &a + CST * i and > the guard is to avoid churn on trees otherwise. > > Your goal is to convert '(unsigned)i' to 'i'. Don't you mean the opposite? Fold 'I' into '(unsigned)i'? > > But definitely TREE_CODE (expr) != ADDR_EXPR is odd since I don't > see the difference between &x and &x + 1. > The issue isn't this, but folding &a[i] into (unsigned)&a + i leads to incorrect code. The original fold never changed the sign. > > { > > aff_tree comb; > > - tree_to_aff_combination (expr, TREE_TYPE (expr), &comb); > > - base = fold_convert (TREE_TYPE (base), aff_combination_to_tree > > (&comb)); > > + tree addr = NULL_TREE; > > + /* Contrary to the comment above, IVopts does not deal well with > > folding > > + address the pointers. > > "folding address the pointers"? > > > We avoid folding ADDR_EXPR but when we have a > > + POINTER_PLUS_EXPR we want to keep the base the same but still fold the > > + offset. */ > > But oddly we did this folding before? Btw, your patch removes the > only use of contain_complex_addr_expr but not the function itself. > No, the original fold preserves the same sign of the expression. Which is fine. This now always folds to unsigned. So I think like we want that if we can't fold to unsigned, we should fold using the original sign still to canonicalize these address expressions? That should cover both cases. > So you're talking about the case where 'base' has a pointer-to-integer > or integer-to-pointer conversion which STRIP_NOPS strips? > > > + if (TREE_CODE (expr) == POINTER_PLUS_EXPR) > > + { > > + addr = TREE_OPERAND (expr, 0); > > + expr = TREE_OPERAND (expr, 1); > > but expr should be unsigned already, see expr_to_aff_combination > So I definitely see cases where expr is unsigned, but has not been folded as unsigned. i.e. the simplification of the expression as unsigned was never performed, only a cast added. I don't see the other conversion to unsigned happening to an affine transform. I only see calls to fold_convert. > > + } > > + > > + tree utype = unsigned_type_for (TREE_TYPE (expr)); > > + tree_to_aff_combination (expr, utype, &comb); > > I don't think tree_to_aff_combination is supposed to "convert" on the > fly. There is aff_combination_convert for this. > Hmm I though the actual conversion happens in aff_combination_to_tree with tree_to_aff_combination just setting the desired type. It looks like aff_combination_convert does the same, except it round trips it back to an aff_tree, but I don't need the round tripping. So I can use it, but it seems more expensive... Thanks for the review, Tamar > > + expr = aff_combination_to_tree (&comb); > > + if (addr) > > + expr = fold_build_pointer_plus (addr, expr); > > + > > + base = fold_convert (TREE_TYPE (base), expr); > > } > > > > iv->base = base; > > > > -- > Richard Biener <rguent...@suse.de> > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)