On Tue, Sep 29, 2020 at 1:30 PM Kewen.Lin <li...@linux.ibm.com> wrote:
>
> Hi,
>
> As the discussion in PR96789, we found that some scalar stmts
> which can be eliminated by some passes after SLP, but we still
> modeled their costs when trying to SLP, it could impact
> vectorizer's decision.  One typical case is the case in PR on
> target Power.
>
> As Richard suggested there, this patch is to introduce one pass
> called scalar_cleanup which has some secondary clean up passes,
> for now they are FRE and DSE.  It's only triggered when seeing
> one TODO flag called TODO_force_next_scalar_cleanup set, unlike
> normal TODO flags, the flag is kept in one global variable
> pending_TODOs and expects its downstream consumer to handle it.
>
> I verified that it can get x264's runtime performance back on
> Power, I also evaluated the compilation time for the SPEC2017
> int benchmarks build with one single job, this patch increase
> the compilation time by 0.74%.
>
> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
>
> Is it ok for trunk?

diff --git a/gcc/tree-ssa-loop-ivcanon.c b/gcc/tree-ssa-loop-ivcanon.c
index 298ab215530..7016f993339 100644
--- a/gcc/tree-ssa-loop-ivcanon.c
+++ b/gcc/tree-ssa-loop-ivcanon.c
@@ -1605,6 +1605,14 @@ pass_complete_unroll::execute (function *fun)
     peeled_loops = BITMAP_ALLOC (NULL);
   unsigned int val = tree_unroll_loops_completely (flag_cunroll_grow_size,
                                                   true);
+
+  /* There are no loops after unrolling, we assume that it's not so costly
+     to do the scalar cleanup since here.  FIXME: Some heuristics can be
+     further added to guard the cost level, like nodes number total, all
+     the original loops should be with single exits, etc.  */
+  if (!current_loops->tree_root->inner)
+    val |= TODO_force_next_scalar_cleanup;
+

so this is not the appropriate way to guard this.  Instead in

static unsigned int
tree_unroll_loops_completely (bool may_increase_size, bool unroll_outer)
{

look where we do

          bitmap fathers = BITMAP_ALLOC (NULL);
          EXECUTE_IF_SET_IN_BITMAP (father_bbs, 0, i, bi)
            {
              basic_block unrolled_loop_bb = BASIC_BLOCK_FOR_FN (cfun, i);
              if (! unrolled_loop_bb)
                continue;
              if (loop_outer (unrolled_loop_bb->loop_father))
                bitmap_set_bit (fathers,
                                unrolled_loop_bb->loop_father->num);

and in the else case return TODO_force_next_scalar_cleanup because
then we know we have unrolled an outermost loop and not ran VN
immediately on it.

+/* Scalar cleanup, it has several gated cleanup passes like FRE, DSE.  */
+
+namespace {
+
+const pass_data pass_data_scalar_cleanup =
+{
+  GIMPLE_PASS, /* type */
+  "*scalar_cleanup", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */

this new "pass" doesn't have to do anything with tree-ssa-loop-ivcanon.c
so please add it to passes.c instead (there's already a bunch of
pass definitions in there).

Can you repeat the compile-time measurement there?  I also wonder
whether we should worry about compile-time at -O[12] when SLP is not run.
Thus, probably rename the cleanup pass to pre_slp_scalar_cleanup and
gate it on && flag_slp_vectorize

Note this is probably the cleanest way to implement this hack.  But it
still is what it is - a hack.  Not a proper fix for whatever the actual issue is
which means I'm not the one that's going to ack it (since I've suggested it).

Thanks,
Richard.

> BR,
> Kewen
> -----------
> gcc/ChangeLog:
>
>         PR tree-optimization/96789
>         * passes.c (execute_one_pass): Add support for
>         TODO_force_next_scalar_cleanup.
>         (pending_TODOs): Init.
>         * passes.def (pass_scalar_cleanup): New pass, add pass_fre and
>         pass_dse as its children.
>         * timevar.def (TV_SCALAR_CLEANUP): New timevar.
>         * tree-pass.h (TODO_force_next_scalar_cleanup): New TODO flag.
>         (make_pass_scalar_cleanup): New function declare.
>         (pending_TODOs): New variable declare.
>         * tree-ssa-loop-ivcanon.c (pass_complete_unroll::execute): Set
>         TODO_force_next_scalar_cleanup if there are no loops.
>         (class pass_scalar_cleanup): New class.
>         (pass_data_scalar_cleanup): New pass data.
>         (make_pass_scalar_cleanup): New function.
>
> gcc/testsuite/ChangeLog:
>
>         PR tree-optimization/96789
>         * gcc.dg/tree-ssa/ssa-dse-28.c: Adjust.
>         * gcc.dg/tree-ssa/ssa-dse-29.c: Likewise.
>         * gcc.dg/tree-ssa/pr96789.c: New test.

Reply via email to