This patch attempts to fix a libgcc codegen regression introduced in
gcc-10, as -ftree-loop-distribute-patterns was enabled at -O2.


The ldist pass turns even very short loops into memset calls.  E.g.,
the TFmode emulation calls end with a loop of up to 3 iterations, to
zero out trailing words, and the loop distribution pass turns them
into calls of the memset builtin.

Though short constant-length memsets are usually dealt with
efficiently, for non-constant-length ones, the options are setmemM, or
a function calls.

RISC-V doesn't have any setmemM pattern, so the loops above end up
"optimized" into memset calls, incurring not only the overhead of an
explicit call, but also discarding the information the compiler has
about the alignment of the destination, and that the length is a
multiple of the word alignment.

This patch adds to the loop distribution pass some cost analysis based
on preexisting *_RATIO macros, so that we won't transform loops with
trip counts as low as the ratios we'd rather expand inline.


This patch is not finished; it needs adjustments to the testsuite, to
make up for the behavior changes it brings about.  Specifically, on a
x86_64-linux-gnu regstrap, it regresses:

> FAIL: gcc.dg/pr53265.c  (test for warnings, line 40)
> FAIL: gcc.dg/pr53265.c  (test for warnings, line 42)
> FAIL: gcc.dg/tree-ssa/ldist-38.c scan-tree-dump ldist "split to 0 loops and 1 
> library cal> FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++14  scan-tree-dump 
> ldist "split to 0 loops and 1 library calls"
> FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++17  scan-tree-dump ldist "split to 
> 0 loops and 1 library calls"
> FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++2a  scan-tree-dump ldist "split to 
> 0 loops and 1 library calls"

I suppose just lengthening the loops will take care of ldist-38 and
pr78847, but the loss of the warnings in pr53265 is more concerning, and
will require investigation.

Nevertheless, I seek feedback on whether this is an acceptable approach,
or whether we should use alternate tuning parameters for ldist, or
something entirely different.  Thanks in advance,


for  gcc/ChangeLog

        * tree-loop-distribution.c (maybe_normalize_partition): New.
        (loop_distribution::distribute_loop): Call it.

[requires testsuite adjustments and investigation of a warning regression]
---
 gcc/tree-loop-distribution.c |   54 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index bb15fd3723fb6..b5198652817ee 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -2848,6 +2848,52 @@ fuse_memset_builtins (vec<struct partition *> 
*partitions)
     }
 }
 
+/* Return false if it's profitable to turn the LOOP PARTITION into a builtin
+   call, and true if it wasn't, changing the PARTITION to PKIND_NORMAL.  */
+
+static bool
+maybe_normalize_partition (class loop *loop, struct partition *partition)
+{
+  unsigned HOST_WIDE_INT ratio;
+
+  switch (partition->kind)
+    {
+    case PKIND_NORMAL:
+    case PKIND_PARTIAL_MEMSET:
+      return false;
+
+    case PKIND_MEMSET:
+      if (integer_zerop (gimple_assign_rhs1 (DR_STMT
+                                            (partition->builtin->dst_dr))))
+       ratio = CLEAR_RATIO (optimize_loop_for_speed_p (loop));
+      else
+       ratio = SET_RATIO (optimize_loop_for_speed_p (loop));
+      break;
+
+    case PKIND_MEMCPY:
+    case PKIND_MEMMOVE:
+      ratio = MOVE_RATIO (optimize_loop_for_speed_p (loop));
+      break;
+
+    default:
+      gcc_unreachable ();
+    }
+
+  tree niters = number_of_latch_executions (loop);
+  if (niters == NULL_TREE || niters == chrec_dont_know)
+    return false;
+
+  wide_int minit, maxit;
+  value_range_kind vrk = determine_value_range (niters, &minit, &maxit);
+  if (vrk == VR_RANGE && wi::ltu_p (maxit, ratio))
+    {
+      partition->kind = PKIND_NORMAL;
+      return true;
+    }
+
+  return false;
+}
+
 void
 loop_distribution::finalize_partitions (class loop *loop,
                                        vec<struct partition *> *partitions,
@@ -3087,6 +3133,14 @@ loop_distribution::distribute_loop (class loop *loop, 
vec<gimple *> stmts,
     }
 
   finalize_partitions (loop, &partitions, &alias_ddrs);
+  {
+    bool any_changes_p = false;
+    for (i = 0; partitions.iterate (i, &partition); ++i)
+      if (maybe_normalize_partition (loop, partition))
+       any_changes_p = true;
+    if (any_changes_p)
+      finalize_partitions (loop, &partitions, &alias_ddrs);
+  }
 
   /* If there is a reduction in all partitions make sure the last one
      is not classified for builtin code generation.  */

-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist         GNU Toolchain Engineer
        Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

Reply via email to