[PATCH][ivopts]: perform affine fold to unsigned on non address expressions. [PR114932]

Tamar Christina Thu, 07 Nov 2024 11:13:47 -0800

Hi All,

When the patch for PR114074 was applied we saw a good boost in exchange2.


This boost was partially caused by a simplification of the addressing modes.
With the patch applied IV opts saw the following form for the base addressing;

  Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) *
324) + 36)

vs what we normally get:

  Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
* 81) + 9) * 4

This is because the patch promoted multiplies where one operand is a constant
from a signed multiply to an unsigned one, to attempt to fold away the constant.

This patch attempts the same but due to the various problems with SCEV and
niters not being able to analyze the resulting forms (i.e. PR114322) we can't
do it during SCEV or in the general form like in fold-const like extract_muldiv
attempts.

Instead this applies the simplification during IVopts initialization when we
create the IV.  Essentially when we know the IV won't overflow with regards to
niters then we perform an affine fold which gets it to simplify the internal
computation, even if this is signed because we know that for IVOPTs uses the
IV won't ever overflow.  This allows IV opts to see the simplified form
without influencing the rest of the compiler.

as mentioned in PR114074 it would be good to fix the missed optimization in the
other passes so we can perform this in general.

The reason this has a big impact on fortran code is that fortran doesn't seem to
have unsigned integer types.  As such all it's addressing are created with
signed types and folding does not happen on them due to the possible overflow.

concretely on AArch64 this changes the results from generation:

        mov     x27, -108
        mov     x24, -72
        mov     x23, -36
        add     x21, x1, x0, lsl 2
        add     x19, x20, x22
.L5:
        add     x0, x22, x19
        add     x19, x19, 324
        ldr     d1, [x0, x27]
        add     v1.2s, v1.2s, v15.2s
        str     d1, [x20, 216]
        ldr     d0, [x0, x24]
        add     v0.2s, v0.2s, v15.2s
        str     d0, [x20, 252]
        ldr     d31, [x0, x23]
        add     v31.2s, v31.2s, v15.2s
        str     d31, [x20, 288]
        bl      digits_20_
        cmp     x21, x19
        bne     .L5

into:

.L5:
        ldr     d1, [x19, -108]
        add     v1.2s, v1.2s, v15.2s
        str     d1, [x20, 216]
        ldr     d0, [x19, -72]
        add     v0.2s, v0.2s, v15.2s
        str     d0, [x20, 252]
        ldr     d31, [x19, -36]
        add     x19, x19, 324
        add     v31.2s, v31.2s, v15.2s
        str     d31, [x20, 288]
        bl      digits_20_
        cmp     x21, x19
        bne     .L5

The two patches together results in a 10% performance increase in exchange2 in
SPECCPU 2017 and a 4% reduction in binary size and a 5% improvement in compile
time. There's also a 5% performance improvement in fotonik3d and similar
reduction in binary size.

The patch folds every IV to unsigned to canonicalize them.  At the end of the
pass we match.pd will then remove unneeded conversions.

Bootstrapped Regtested on aarch64-none-linux-gnu,
x86_64-pc-linux-gnu -m32, -m64 and some issues below:

 * gcc.dg/torture/bitint-49.c   -O1  execution test
 * gcc.c-torture/execute/pr110115.c   -O1  execution test

These two start to fail now because of a bug in the stack slot sharing conflict
function.  Basically the change changes the addressing from ADDR_REF to
(unsigned) ADDR_REF and add_scope_conflics_2 does not look deep enough through
the promotion to realize that the two values are live at the same time.

Both of these issues are fixed by Andrew's patch [1],  Since this patch rewrites
the entire thing, it didn't seem useful for me to provide a spot fix for this.

[1] 
https://inbox.sourceware.org/gcc-patches/20241017024205.2660484-1-quic_apin...@quicinc.com/

Ok for master after Andrew's patch gets in?

Thanks,
Tamar

gcc/ChangeLog:

        PR tree-optimization/114932
        * tree-scalar-evolution.cc (alloc_iv): Perform affine unsigned fold.

gcc/testsuite/ChangeLog:

        PR tree-optimization/114932
        * gcc.dg/tree-ssa/pr64705.c: Update dump file scan.
        * gcc.target/i386/pr115462.c: The testcase shares 3 IVs which calculates
        the same thing but with a slightly different increment offset.  The test
        checks for 3 complex addressing loads, one for each IV.  But with this
        change they now all share one IV.  That is the loop now only has one
        complex addressing.  This is ultimately driven by the backend costing
        and the current costing says this is preferred so updating the testcase.
        * gfortran.dg/addressing-modes_1.f90: New test.

---
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c
index 
fd24e38a53e9f3a4659dece5af4d71fbc0ce2c18..3c9c2e5deed1ba755d1fae15a6553d68b6ad0098
 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c
@@ -23,4 +23,4 @@ int foo(char *flags, long len, long i, long steps)
 
 /* Don't expand iv {base+step, step}_loop into {base+x+y, step}_loop
    even if "step == x + y".  */
-/* { dg-final { scan-tree-dump "Base:\\tstep_\[0-9\]* \\+ 
iter|Base:\\titer_\[0-9\]* \\+ step" "ivopts"} } */
+/* { dg-final { scan-tree-dump {Base:\t\(unsigned ?.*\) step_[0-9]* \+ 
\(unsigned ?.*\) iter|Base:\t\(unsigned ?.*\) iter_[0-9]* \+ \(unsigned ?.*\) 
step} "ivopts"} } */
diff --git a/gcc/testsuite/gcc.target/i386/pr115462.c 
b/gcc/testsuite/gcc.target/i386/pr115462.c
index 
ad50a6382bc4bed3e24a281663f50d24b989560d..6526f99bf0e3849c0f633932f24bd7e93d737f46
 100644
--- a/gcc/testsuite/gcc.target/i386/pr115462.c
+++ b/gcc/testsuite/gcc.target/i386/pr115462.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -mavx2 -fno-tree-vectorize -fno-pic" } */
-/* { dg-final { scan-assembler-times {(?n)movl[ \t]+.*, p1\.0\+[0-9]*\(,} 3 } 
} */
+/* { dg-final { scan-assembler-times {(?n)movl[ \t]+.*, p1\.0\+[0-9]*\(,} 1 } 
} */
 
 int
 foo (long indx, long indx2, long indx3, long indx4, long indx5, long indx6, 
long n, int* q)
diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_1.f90 
b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
new file mode 100644
index 
0000000000000000000000000000000000000000..334d5bc47a16e53e9168bb1f90dfeff584b4e494
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
@@ -0,0 +1,37 @@
+! { dg-do compile { target aarch64-*-* } }
+! { dg-additional-options "-w -Ofast" }
+
+  module brute_force
+    integer, parameter :: r=9
+     integer  block(r, r, 0)
+    contains
+  subroutine brute
+     do
+      do
+          do
+           do
+                do
+                     do
+                         do i7 = l0, 1
+                       select case(1 )
+                       case(1)
+                           block(:2, 7:, 1) = block(:2, 7:, i7) - 1
+                       end select
+                            do i8 = 1, 1
+                               do i9 = 1, 1
+                            if(1 == 1) then
+                                    call digits_20
+                                end if
+                                end do
+                          end do
+                    end do
+                    end do
+              end do
+              end do
+           end do
+     end do
+  end do
+ end
+  end
+
+! { dg-final { scan-assembler-not {ldr\s+d([0-9]+),\s+\[x[0-9]+, x[0-9]+\]} } }
diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
index 
bf4a865bdfe7752a161fecea8b10d0ffb54c9833..4772af4a52a7f0a0662ce7bf0b2a655b02b6418d
 100644
--- a/gcc/tree-ssa-loop-ivopts.cc
+++ b/gcc/tree-ssa-loop-ivopts.cc
@@ -1190,19 +1190,11 @@ alloc_iv (struct ivopts_data *data, tree base, tree 
step,
                                              sizeof (struct iv));
   gcc_assert (step != NULL_TREE);
 
-  /* Lower address expression in base except ones with DECL_P as operand.
-     By doing this:
-       1) More accurate cost can be computed for address expressions;
-       2) Duplicate candidates won't be created for bases in different
-         forms, like &a[0] and &a.  */
+  aff_tree comb;
   STRIP_NOPS (expr);
-  if ((TREE_CODE (expr) == ADDR_EXPR && !DECL_P (TREE_OPERAND (expr, 0)))
-      || contain_complex_addr_expr (expr))
-    {
-      aff_tree comb;
-      tree_to_aff_combination (expr, TREE_TYPE (expr), &comb);
-      base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
-    }
+  expr = fold_convert (unsigned_type_for (TREE_TYPE (expr)), expr);
+  tree_to_aff_combination (expr, TREE_TYPE (expr), &comb);
+  base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
 
   iv->base = base;
   iv->base_object = determine_base_object (data, base);




--

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c b/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c
index fd24e38a53e9f3a4659dece5af4d71fbc0ce2c18..3c9c2e5deed1ba755d1fae15a6553d68b6ad0098 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr64705.c
@@ -23,4 +23,4 @@ int foo(char *flags, long len, long i, long steps)
 
 /* Don't expand iv {base+step, step}_loop into {base+x+y, step}_loop
    even if "step == x + y".  */
-/* { dg-final { scan-tree-dump "Base:\\tstep_\[0-9\]* \\+ iter|Base:\\titer_\[0-9\]* \\+ step" "ivopts"} } */
+/* { dg-final { scan-tree-dump {Base:\t\(unsigned ?.*\) step_[0-9]* \+ \(unsigned ?.*\) iter|Base:\t\(unsigned ?.*\) iter_[0-9]* \+ \(unsigned ?.*\) step} "ivopts"} } */
diff --git a/gcc/testsuite/gcc.target/i386/pr115462.c b/gcc/testsuite/gcc.target/i386/pr115462.c
index ad50a6382bc4bed3e24a281663f50d24b989560d..6526f99bf0e3849c0f633932f24bd7e93d737f46 100644
--- a/gcc/testsuite/gcc.target/i386/pr115462.c
+++ b/gcc/testsuite/gcc.target/i386/pr115462.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -mavx2 -fno-tree-vectorize -fno-pic" } */
-/* { dg-final { scan-assembler-times {(?n)movl[ \t]+.*, p1\.0\+[0-9]*\(,} 3 } } */
+/* { dg-final { scan-assembler-times {(?n)movl[ \t]+.*, p1\.0\+[0-9]*\(,} 1 } } */
 
 int
 foo (long indx, long indx2, long indx3, long indx4, long indx5, long indx6, long n, int* q)
diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_1.f90 b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
new file mode 100644
index 0000000000000000000000000000000000000000..334d5bc47a16e53e9168bb1f90dfeff584b4e494
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
@@ -0,0 +1,37 @@
+! { dg-do compile { target aarch64-*-* } }
+! { dg-additional-options "-w -Ofast" }
+
+  module brute_force
+    integer, parameter :: r=9
+     integer  block(r, r, 0)
+    contains
+  subroutine brute
+     do
+      do
+          do
+           do
+                do
+                     do
+                         do i7 = l0, 1
+                       select case(1 )
+                       case(1)
+                           block(:2, 7:, 1) = block(:2, 7:, i7) - 1
+                       end select
+                            do i8 = 1, 1
+                               do i9 = 1, 1
+                            if(1 == 1) then
+                                    call digits_20
+                                end if
+                                end do
+                          end do
+                    end do
+                    end do
+              end do
+              end do
+           end do
+     end do
+  end do
+ end
+  end
+
+! { dg-final { scan-assembler-not {ldr\s+d([0-9]+),\s+\[x[0-9]+, x[0-9]+\]} } }
diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
index bf4a865bdfe7752a161fecea8b10d0ffb54c9833..4772af4a52a7f0a0662ce7bf0b2a655b02b6418d 100644
--- a/gcc/tree-ssa-loop-ivopts.cc
+++ b/gcc/tree-ssa-loop-ivopts.cc
@@ -1190,19 +1190,11 @@ alloc_iv (struct ivopts_data *data, tree base, tree step,
 					      sizeof (struct iv));
   gcc_assert (step != NULL_TREE);
 
-  /* Lower address expression in base except ones with DECL_P as operand.
-     By doing this:
-       1) More accurate cost can be computed for address expressions;
-       2) Duplicate candidates won't be created for bases in different
-	  forms, like &a[0] and &a.  */
+  aff_tree comb;
   STRIP_NOPS (expr);
-  if ((TREE_CODE (expr) == ADDR_EXPR && !DECL_P (TREE_OPERAND (expr, 0)))
-      || contain_complex_addr_expr (expr))
-    {
-      aff_tree comb;
-      tree_to_aff_combination (expr, TREE_TYPE (expr), &comb);
-      base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
-    }
+  expr = fold_convert (unsigned_type_for (TREE_TYPE (expr)), expr);
+  tree_to_aff_combination (expr, TREE_TYPE (expr), &comb);
+  base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
 
   iv->base = base;
   iv->base_object = determine_base_object (data, base);

[PATCH][ivopts]: perform affine fold to unsigned on non address expressions. [PR114932]

Reply via email to