On Tue, Sep 29, 2020 at 1:30 PM Kewen.Lin <li...@linux.ibm.com> wrote: > > Hi, > > As the discussion in PR96789, we found that some scalar stmts > which can be eliminated by some passes after SLP, but we still > modeled their costs when trying to SLP, it could impact > vectorizer's decision. One typical case is the case in PR on > target Power. > > As Richard suggested there, this patch is to introduce one pass > called scalar_cleanup which has some secondary clean up passes, > for now they are FRE and DSE. It's only triggered when seeing > one TODO flag called TODO_force_next_scalar_cleanup set, unlike > normal TODO flags, the flag is kept in one global variable > pending_TODOs and expects its downstream consumer to handle it. > > I verified that it can get x264's runtime performance back on > Power, I also evaluated the compilation time for the SPEC2017 > int benchmarks build with one single job, this patch increase > the compilation time by 0.74%. > > Bootstrapped/regtested on powerpc64le-linux-gnu P9. > > Is it ok for trunk?
diff --git a/gcc/tree-ssa-loop-ivcanon.c b/gcc/tree-ssa-loop-ivcanon.c index 298ab215530..7016f993339 100644 --- a/gcc/tree-ssa-loop-ivcanon.c +++ b/gcc/tree-ssa-loop-ivcanon.c @@ -1605,6 +1605,14 @@ pass_complete_unroll::execute (function *fun) peeled_loops = BITMAP_ALLOC (NULL); unsigned int val = tree_unroll_loops_completely (flag_cunroll_grow_size, true); + + /* There are no loops after unrolling, we assume that it's not so costly + to do the scalar cleanup since here. FIXME: Some heuristics can be + further added to guard the cost level, like nodes number total, all + the original loops should be with single exits, etc. */ + if (!current_loops->tree_root->inner) + val |= TODO_force_next_scalar_cleanup; + so this is not the appropriate way to guard this. Instead in static unsigned int tree_unroll_loops_completely (bool may_increase_size, bool unroll_outer) { look where we do bitmap fathers = BITMAP_ALLOC (NULL); EXECUTE_IF_SET_IN_BITMAP (father_bbs, 0, i, bi) { basic_block unrolled_loop_bb = BASIC_BLOCK_FOR_FN (cfun, i); if (! unrolled_loop_bb) continue; if (loop_outer (unrolled_loop_bb->loop_father)) bitmap_set_bit (fathers, unrolled_loop_bb->loop_father->num); and in the else case return TODO_force_next_scalar_cleanup because then we know we have unrolled an outermost loop and not ran VN immediately on it. +/* Scalar cleanup, it has several gated cleanup passes like FRE, DSE. */ + +namespace { + +const pass_data pass_data_scalar_cleanup = +{ + GIMPLE_PASS, /* type */ + "*scalar_cleanup", /* name */ + OPTGROUP_LOOP, /* optinfo_flags */ this new "pass" doesn't have to do anything with tree-ssa-loop-ivcanon.c so please add it to passes.c instead (there's already a bunch of pass definitions in there). Can you repeat the compile-time measurement there? I also wonder whether we should worry about compile-time at -O[12] when SLP is not run. Thus, probably rename the cleanup pass to pre_slp_scalar_cleanup and gate it on && flag_slp_vectorize Note this is probably the cleanest way to implement this hack. But it still is what it is - a hack. Not a proper fix for whatever the actual issue is which means I'm not the one that's going to ack it (since I've suggested it). Thanks, Richard. > BR, > Kewen > ----------- > gcc/ChangeLog: > > PR tree-optimization/96789 > * passes.c (execute_one_pass): Add support for > TODO_force_next_scalar_cleanup. > (pending_TODOs): Init. > * passes.def (pass_scalar_cleanup): New pass, add pass_fre and > pass_dse as its children. > * timevar.def (TV_SCALAR_CLEANUP): New timevar. > * tree-pass.h (TODO_force_next_scalar_cleanup): New TODO flag. > (make_pass_scalar_cleanup): New function declare. > (pending_TODOs): New variable declare. > * tree-ssa-loop-ivcanon.c (pass_complete_unroll::execute): Set > TODO_force_next_scalar_cleanup if there are no loops. > (class pass_scalar_cleanup): New class. > (pass_data_scalar_cleanup): New pass data. > (make_pass_scalar_cleanup): New function. > > gcc/testsuite/ChangeLog: > > PR tree-optimization/96789 > * gcc.dg/tree-ssa/ssa-dse-28.c: Adjust. > * gcc.dg/tree-ssa/ssa-dse-29.c: Likewise. > * gcc.dg/tree-ssa/pr96789.c: New test.