Ping
> -----Original Message----- > From: Tamar Christina <tamar.christ...@arm.com> > Sent: Thursday, November 7, 2024 11:50 AM > To: gcc-patches@gcc.gnu.org > Cc: nd <n...@arm.com>; rguent...@suse.de > Subject: [PATCH][ivopts]: perform affine fold to unsigned on non address > expressions. [PR114932] > > Hi All, > > When the patch for PR114074 was applied we saw a good boost in exchange2. > > This boost was partially caused by a simplification of the addressing modes. > With the patch applied IV opts saw the following form for the base addressing; > > Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) * > 324) + 36) > > vs what we normally get: > > Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D) > * 81) + 9) * 4 > > This is because the patch promoted multiplies where one operand is a constant > from a signed multiply to an unsigned one, to attempt to fold away the > constant. > > This patch attempts the same but due to the various problems with SCEV and > niters not being able to analyze the resulting forms (i.e. PR114322) we can't > do it during SCEV or in the general form like in fold-const like > extract_muldiv > attempts. > > Instead this applies the simplification during IVopts initialization when we > create the IV. Essentially when we know the IV won't overflow with regards to > niters then we perform an affine fold which gets it to simplify the internal > computation, even if this is signed because we know that for IVOPTs uses the > IV won't ever overflow. This allows IV opts to see the simplified form > without influencing the rest of the compiler. > > as mentioned in PR114074 it would be good to fix the missed optimization in > the > other passes so we can perform this in general. > > The reason this has a big impact on fortran code is that fortran doesn't seem > to > have unsigned integer types. As such all it's addressing are created with > signed types and folding does not happen on them due to the possible overflow. > > concretely on AArch64 this changes the results from generation: > > mov x27, -108 > mov x24, -72 > mov x23, -36 > add x21, x1, x0, lsl 2 > add x19, x20, x22 > .L5: > add x0, x22, x19 > add x19, x19, 324 > ldr d1, [x0, x27] > add v1.2s, v1.2s, v15.2s > str d1, [x20, 216] > ldr d0, [x0, x24] > add v0.2s, v0.2s, v15.2s > str d0, [x20, 252] > ldr d31, [x0, x23] > add v31.2s, v31.2s, v15.2s > str d31, [x20, 288] > bl digits_20_ > cmp x21, x19 > bne .L5 > > into: > > .L5: > ldr d1, [x19, -108] > add v1.2s, v1.2s, v15.2s > str d1, [x20, 216] > ldr d0, [x19, -72] > add v0.2s, v0.2s, v15.2s > str d0, [x20, 252] > ldr d31, [x19, -36] > add x19, x19, 324 > add v31.2s, v31.2s, v15.2s > str d31, [x20, 288] > bl digits_20_ > cmp x21, x19 > bne .L5 > > The two patches together results in a 10% performance increase in exchange2 in > SPECCPU 2017 and a 4% reduction in binary size and a 5% improvement in > compile > time. There's also a 5% performance improvement in fotonik3d and similar > reduction in binary size. > > The patch folds every IV to unsigned to canonicalize them. At the end of the > pass we match.pd will then remove unneeded conversions. > > Bootstrapped Regtested on aarch64-none-linux-gnu, > x86_64-pc-linux-gnu -m32, -m64 and some issues below: > > * gcc.dg/torture/bitint-49.c -O1 execution test > * gcc.c-torture/execute/pr110115.c -O1 execution test > > These two start to fail now because of a bug in the stack slot sharing > conflict > function. Basically the change changes the addressing from ADDR_REF to > (unsigned) ADDR_REF and add_scope_conflics_2 does not look deep enough > through > the promotion to realize that the two values are live at the same time. > > Both of these issues are fixed by Andrew's patch [1], Since this patch > rewrites > the entire thing, it didn't seem useful for me to provide a spot fix for this. > > [1] https://inbox.sourceware.org/gcc-patches/20241017024205.2660484-1- > quic_apin...@quicinc.com/ > > Ok for master after Andrew's patch gets in? > > Thanks, > Tamar > > gcc/ChangeLog: > > PR tree-optimization/114932 > * tree-scalar-evolution.cc (alloc_iv): Perform affine unsigned fold. > > gcc/testsuite/ChangeLog: > > PR tree-optimization/114932 > * gcc.dg/tree-ssa/pr64705.c: Update dump file scan. > * gcc.target/i386/pr115462.c: The testcase shares 3 IVs which calculates > the same thing but with a slightly different increment offset. The test > checks for 3 complex addressing loads, one for each IV. But with this > change they now all share one IV. That is the loop now only has one > complex addressing. This is ultimately driven by the backend costing > and the current costing says this is preferred so updating the testcase. > * gfortran.dg/addressing-modes_1.f90: New test. > > --- > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c > b/gcc/testsuite/gcc.dg/tree- > ssa/pr64705.c > index > fd24e38a53e9f3a4659dece5af4d71fbc0ce2c18..3c9c2e5deed1ba755d1fae15a6 > 553d68b6ad0098 100644 > --- a/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c > +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c > @@ -23,4 +23,4 @@ int foo(char *flags, long len, long i, long steps) > > /* Don't expand iv {base+step, step}_loop into {base+x+y, step}_loop > even if "step == x + y". */ > -/* { dg-final { scan-tree-dump "Base:\\tstep_\[0-9\]* \\+ > iter|Base:\\titer_\[0- > 9\]* \\+ step" "ivopts"} } */ > +/* { dg-final { scan-tree-dump {Base:\t\(unsigned ?.*\) step_[0-9]* \+ > \(unsigned > ?.*\) iter|Base:\t\(unsigned ?.*\) iter_[0-9]* \+ \(unsigned ?.*\) step} > "ivopts"} } > */ > diff --git a/gcc/testsuite/gcc.target/i386/pr115462.c > b/gcc/testsuite/gcc.target/i386/pr115462.c > index > ad50a6382bc4bed3e24a281663f50d24b989560d..6526f99bf0e3849c0f63393 > 2f24bd7e93d737f46 100644 > --- a/gcc/testsuite/gcc.target/i386/pr115462.c > +++ b/gcc/testsuite/gcc.target/i386/pr115462.c > @@ -1,6 +1,6 @@ > /* { dg-do compile } */ > /* { dg-options "-O2 -mavx2 -fno-tree-vectorize -fno-pic" } */ > -/* { dg-final { scan-assembler-times {(?n)movl[ \t]+.*, p1\.0\+[0-9]*\(,} 3 > } } */ > +/* { dg-final { scan-assembler-times {(?n)movl[ \t]+.*, p1\.0\+[0-9]*\(,} 1 > } } */ > > int > foo (long indx, long indx2, long indx3, long indx4, long indx5, long indx6, > long n, > int* q) > diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_1.f90 > b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90 > new file mode 100644 > index > 0000000000000000000000000000000000000000..334d5bc47a16e53e9168b > b1f90dfeff584b4e494 > --- /dev/null > +++ b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90 > @@ -0,0 +1,37 @@ > +! { dg-do compile { target aarch64-*-* } } > +! { dg-additional-options "-w -Ofast" } > + > + module brute_force > + integer, parameter :: r=9 > + integer block(r, r, 0) > + contains > + subroutine brute > + do > + do > + do > + do > + do > + do > + do i7 = l0, 1 > + select case(1 ) > + case(1) > + block(:2, 7:, 1) = block(:2, 7:, i7) - 1 > + end select > + do i8 = 1, 1 > + do i9 = 1, 1 > + if(1 == 1) then > + call digits_20 > + end if > + end do > + end do > + end do > + end do > + end do > + end do > + end do > + end do > + end do > + end > + end > + > +! { dg-final { scan-assembler-not {ldr\s+d([0-9]+),\s+\[x[0-9]+, x[0-9]+\]} > } } > diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc > index > bf4a865bdfe7752a161fecea8b10d0ffb54c9833..4772af4a52a7f0a0662ce7bf0b > 2a655b02b6418d 100644 > --- a/gcc/tree-ssa-loop-ivopts.cc > +++ b/gcc/tree-ssa-loop-ivopts.cc > @@ -1190,19 +1190,11 @@ alloc_iv (struct ivopts_data *data, tree base, tree > step, > sizeof (struct iv)); > gcc_assert (step != NULL_TREE); > > - /* Lower address expression in base except ones with DECL_P as operand. > - By doing this: > - 1) More accurate cost can be computed for address expressions; > - 2) Duplicate candidates won't be created for bases in different > - forms, like &a[0] and &a. */ > + aff_tree comb; > STRIP_NOPS (expr); > - if ((TREE_CODE (expr) == ADDR_EXPR && !DECL_P (TREE_OPERAND (expr, 0))) > - || contain_complex_addr_expr (expr)) > - { > - aff_tree comb; > - tree_to_aff_combination (expr, TREE_TYPE (expr), &comb); > - base = fold_convert (TREE_TYPE (base), aff_combination_to_tree > (&comb)); > - } > + expr = fold_convert (unsigned_type_for (TREE_TYPE (expr)), expr); > + tree_to_aff_combination (expr, TREE_TYPE (expr), &comb); > + base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb)); > > iv->base = base; > iv->base_object = determine_base_object (data, base); > > > > > --