Re: [RFC] Don't move cold code out of loop by checking bb count

Richard Biener via Gcc-patches Thu, 26 Aug 2021 04:33:49 -0700

On Tue, Aug 10, 2021 at 4:03 AM Xionghu Luo <luo...@linux.ibm.com> wrote:
>
> Hi,
>
> On 2021/8/6 20:15, Richard Biener wrote:
> > On Mon, Aug 2, 2021 at 7:05 AM Xiong Hu Luo <luo...@linux.ibm.com> wrote:
> >>
> >> There was a patch trying to avoid move cold block out of loop:
> >>
> >> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> >>
> >> Richard suggested to "never hoist anything from a bb with lower execution
> >> frequency to a bb with higher one in LIM invariantness_dom_walker
> >> before_dom_children".
> >>
> >> This patch does this profile count check in both gimple LIM
> >> move_computations_worker and RTL loop-invariant.c find_invariants_bb,
> >> if the loop bb is colder than loop preheader, don't hoist it out of
> >> loop.
> >>
> >> Also, the profile count in loop split pass should be corrected to avoid
> >> lim2 and lim4 mismatch behavior, currently, the new loop preheader 
> >> generated
> >> by loop_version is set to "[count: 0]:", then lim4 after lsplt pass will
> >> move statement out of loop unexpectely when lim2 didn't move it.  This
> >> change could fix regression on 544.nab_r from -1.55% to +0.46%.
> >>
> >> SPEC2017 performance evaluation shows 1% performance improvement for
> >> intrate GEOMEAN and no obvious regression for others.  Especially,
> >> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> >> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> >> on P8LE.
> >>
> >> Regression and bootstrap tested pass on P8LE, any comments?  Thanks.
> >
> > While I'm not familiar with the RTL invariant motion pass the patch there
> > looks reasonable.  Note that we should assess the profile quality
> > somehow - I'm not sure how to do that, CCed Honza for that.
>
> Thanks.
>
> >
> > For the GIMPLE part the patch looks quite complicated - but note it
> > probably has to be since LIM performs kind of a "CSE" on loads
> > (and stores for store-motion), so when there are multiple stmts
> > affected by a hoisting decision the biggest block count has to be
> > accounted.  Likewise when there are dependent stmts involved
> > that might include conditional stmts (a "PHI"), but the overall
> > cost should be looked at.
>
> Currently, The gimple code check two situations with the patch:
> 1) The statement or PHI‘s BB is *colder* then preheader, don't move it out
> of loop;
> 2) The statement or PHI's BB is *hotter* then preheader, but any of it's rhs
> couldn't be moved out of loop, also don't move it out of loop to avoid 
> definition
> not dominates use error.


But part 2) is obviously already done.  What I tried to say is your heuristic
doesn't integrate nicely with the pass but I admitted that it might be a bit
difficult to find a place to add this heuristic.

There is lim_data->cost which we could bias negatively but then this is
a cost that is independent on the hoisting distance.  But doing this would
work at least for the case where the immediately enclosing loop preheader
is hotter than the stmt and with this it would be a patch that's similarly
simple as the RTL one.

Another possibility is to simply only adjust PHI processing in
compute_invariantness, capping movement according to the hotness
heuristic.  The same could be done for regular stmts there but I'm
not sure that will do good in the end since this function is supposed
to compute "correctness" (well, it also has the cost stuff), and it's
not the place to do overall cost considerations.

> May be I could collect the number of instructions not hoisted with the patch
> on regression tests and SPEC2017 to do a estimation for "multiple stmts 
> affected"
> and "overall cost" need to be considered?  But it seems 
> move_computations_worker
> couldn't rollback if we still want to hoist multiple stmts out during the 
> iterations?
>
> >
> > Now - GIMPLE LIM "costing" is somewhat backward right now
> > and it isn't set up to consider those multiple involved stmts.  Plus
> > the store-motion part does not have any cost part (but it depends
> > on previously decided invariant motions).
> >
> > I think the way you implemented the check will cause no hoisting
> > to be performed instead of, say, hoisting to a different loop level
> > only.  Possibly shown when you consider a loop nest like
> >
> >    for (;;)
> >      if (unlikely_cond)
> >        for (;;)
> >           invariant;
> >
> > we want to hoist 'invariant' but only from the inner loop even if it
> > is invariant also in the outer loop.
>
>
> For this case, theorotically I think the master GCC will optimize it to:
>
>   invariant;
>   for (;;)
>     if (unlikely_cond)
>       for (;;)
>          ;
>
> 'invariant' is moved out of outer loop, but with the patch, it will get:
>
>   for (;;)
>     if (unlikely_cond)
>       {
>         invariant;
>         for (;;)
>            ;
>       }
>
> 'invariant' is *cold* for outer loop, but it is still *hot* for inner loop,
> so hoist it out of inner loop, this is exactly what we want, right?

Yes.  I had doubts your patch would achieve that.

>
> >  But for example if there is
> > a store motion opportunity like
> >
> >    for (;;)
> >       {
> >          if (unlikely_cond)
> >            for (;;)
> >              a = ...;
> >          a = ...;
> >       }
> >
> > we'd still want to perform the store motion on the outer loop.
> >
> > Note that store-motion already performs part of the transform
> > before dependent code is moved in move_computations (that
> > you patched).
>
> Yes.  do_store_motion is running before move_computations_worker, store
> motion happens earlier in execute_sm, I also added the check in execute_sm
> to stop cold store moved out of loop.  So for your case, I think my patch
> will similarly optimize it to:
>
>   for (;;)
>      {
>         if (unlikely_cond)
>           {
>             for (;;)
>               ;
>             a = ...;
>           }
>      }
>     a = ...;
>
> Whether this is better?  Will construct cases to verify it.
>
> >
> > IIRC your main concern were the COND_EXPRs we insert
> > for hoisted conditional stmts?
>
> Not sure what you mean here of COND_EXPRs?

The PHIs we hoist and for which we insert COND_EXPRs.  IIRC that was
your original complaint.  So my question is whether we can fix the really
bad cases moving PHIs with sth local to compute_invariantness.

Richard.

>
> Thanks,
> Xionghu
>
> >
> > Thanks,
> > Richard.
> >
> >> gcc/ChangeLog:
> >>
> >>          * loop-invariant.c (find_invariants_bb): Check profile count
> >>          before motion.
> >>          (find_invariants_body): Add argument.
> >>          * tree-ssa-loop-im.c (move_computations_worker): Check profile
> >>          count before motion.
> >>          (execute_sm): Likewise.
> >>          (execute_sm_exit): Check pointer validness.
> >>          * tree-ssa-loop-split.c (split_loop): Correct probability.
> >>          (do_split_loop_on_cond): Likewise.
> >>
> >> gcc/testsuite/ChangeLog:
> >>
> >>          * gcc.dg/tree-ssa/recip-3.c: Adjust.
> >> ---
> >>   gcc/loop-invariant.c                    |  10 +-
> >>   gcc/testsuite/gcc.dg/tree-ssa/recip-3.c |   2 +-
> >>   gcc/tree-ssa-loop-im.c                  | 164 +++++++++++++++++++++++-
> >>   gcc/tree-ssa-loop-split.c               |  14 +-
> >>   4 files changed, 177 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
> >> index bdc7b59dd5f..7b5d64d11f9 100644
> >> --- a/gcc/loop-invariant.c
> >> +++ b/gcc/loop-invariant.c
> >> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool 
> >> always_reached, bool always_executed)
> >>      call.  */
> >>
> >>   static void
> >> -find_invariants_bb (basic_block bb, bool always_reached, bool 
> >> always_executed)
> >> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
> >> +                   bool always_executed)
> >>   {
> >>     rtx_insn *insn;
> >> +  basic_block preheader = loop_preheader_edge (loop)->src;
> >> +
> >> +  if (preheader->count > bb->count)
> >> +    return;
> >>
> >>     FOR_BB_INSNS (bb, insn)
> >>       {
> >> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block 
> >> *body,
> >>     unsigned i;
> >>
> >>     for (i = 0; i < loop->num_nodes; i++)
> >> -    find_invariants_bb (body[i],
> >> -                       bitmap_bit_p (always_reached, i),
> >> +    find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i),
> >>                          bitmap_bit_p (always_executed, i));
> >>   }
> >>
> >> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c 
> >> b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> >> index 638bf38db8c..641c91e719e 100644
> >> --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> >> +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> >> @@ -23,4 +23,4 @@ float h ()
> >>          F[0] += E / d;
> >>   }
> >>
> >> -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
> >> +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
> >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> >> index 7de47edbcb3..2bfb5e8ec15 100644
> >> --- a/gcc/tree-ssa-loop-im.c
> >> +++ b/gcc/tree-ssa-loop-im.c
> >> @@ -1147,6 +1147,61 @@ move_computations_worker (basic_block bb)
> >>            continue;
> >>          }
> >>
> >> +      edge e = loop_preheader_edge (level);
> >> +      if (e->src->count > bb->count)
> >> +       {
> >> +         if (dump_file && (dump_flags & TDF_DETAILS))
> >> +           {
> >> +             fprintf (dump_file, "PHI node NOT moved to %d from %d:\n",
> >> +                      e->src->index, bb->index);
> >> +             print_gimple_stmt (dump_file, stmt, 0);
> >> +             fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost,
> >> +                      level->num);
> >> +           }
> >> +         gsi_next (&bsi);
> >> +         continue;
> >> +       }
> >> +      else
> >> +       {
> >> +         unsigned i;
> >> +         bool skip_phi_move = false;
> >> +         for (i = 0; i < gimple_phi_num_args (stmt); i++)
> >> +           {
> >> +             tree def = PHI_ARG_DEF (stmt, i);
> >> +
> >> +             if (TREE_CODE (def) != SSA_NAME)
> >> +               continue;
> >> +
> >> +             gimple *def_stmt = SSA_NAME_DEF_STMT (def);
> >> +
> >> +             if (!gimple_bb (def_stmt))
> >> +               continue;
> >> +
> >> +             if (!dominated_by_p (CDI_DOMINATORS, e->src,
> >> +                                  gimple_bb (def_stmt)))
> >> +               {
> >> +                 if (dump_file && (dump_flags & TDF_DETAILS))
> >> +                   {
> >> +                     fprintf (dump_file,
> >> +                              "PHI node NOT moved to %d [local count:%d] 
> >> from "
> >> +                              "%d [local count:%d]:\n",
> >> +                              e->src->index, e->src->count.value (), 
> >> bb->index,
> >> +                              bb->count.value ());
> >> +                     print_gimple_stmt (dump_file, stmt, 0);
> >> +                     fprintf (dump_file, "(cost %u) out of loop %d.\n\n", 
> >> cost,
> >> +                              level->num);
> >> +                   }
> >> +                 skip_phi_move = true;
> >> +                 break;
> >> +               }
> >> +           }
> >> +         if (skip_phi_move)
> >> +           {
> >> +             gsi_next (&bsi);
> >> +             continue;
> >> +           }
> >> +       }
> >> +
> >>         if (dump_file && (dump_flags & TDF_DETAILS))
> >>          {
> >>            fprintf (dump_file, "Moving PHI node\n");
> >> @@ -1184,14 +1239,13 @@ move_computations_worker (basic_block bb)
> >>            tree lhs = gimple_assign_lhs (new_stmt);
> >>            SSA_NAME_RANGE_INFO (lhs) = NULL;
> >>          }
> >> -      gsi_insert_on_edge (loop_preheader_edge (level), new_stmt);
> >> +      gsi_insert_on_edge (e, new_stmt);
> >>         remove_phi_node (&bsi, false);
> >>       }
> >>
> >>     for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); )
> >>       {
> >>         edge e;
> >> -
> >>         gimple *stmt = gsi_stmt (bsi);
> >>
> >>         lim_data = get_lim_data (stmt);
> >> @@ -1214,7 +1268,90 @@ move_computations_worker (basic_block bb)
> >>         /* We do not really want to move conditionals out of the loop; we 
> >> just
> >>           placed it here to force its operands to be moved if necessary.  
> >> */
> >>         if (gimple_code (stmt) == GIMPLE_COND)
> >> -       continue;
> >> +       {
> >> +         gsi_next (&bsi);
> >> +         continue;
> >> +       }
> >> +
> >> +      e = loop_preheader_edge (level);
> >> +      if (e->src->count > bb->count)
> >> +       {
> >> +         if (dump_file && (dump_flags & TDF_DETAILS))
> >> +           {
> >> +             fprintf (dump_file,
> >> +                      "stmt: Statement NOT moved to %d [local count:%d] 
> >> from "
> >> +                      "%d [local count:%d]:\n",
> >> +                      e->src->index, e->src->count.value (), bb->index,
> >> +                      bb->count.value ());
> >> +             print_gimple_stmt (dump_file, stmt, 0);
> >> +             fprintf (dump_file, "(cost %u) out of loop %d.\n\n", cost,
> >> +                      level->num);
> >> +           }
> >> +         gsi_next (&bsi);
> >> +         continue;
> >> +       }
> >> +      else
> >> +       {
> >> +         if (is_gimple_assign (stmt))
> >> +           {
> >> +             tree rhs1 = gimple_assign_rhs1 (stmt);
> >> +             tree rhs2 = gimple_assign_rhs2 (stmt);
> >> +             if (TREE_CODE (rhs1) == MEM_REF)
> >> +               {
> >> +                 rhs2 = TREE_OPERAND (rhs1, 1);
> >> +                 rhs1 = TREE_OPERAND (rhs1, 0);
> >> +               }
> >> +             gimple *stmt1 = NULL, *stmt2 = NULL;
> >> +             basic_block def_bb;
> >> +             if (rhs1 && TREE_CODE (rhs1) == SSA_NAME)
> >> +               {
> >> +                 stmt1 = SSA_NAME_DEF_STMT (rhs1);
> >> +                 def_bb = gimple_bb (stmt1);
> >> +                 if (stmt1
> >> +                     && def_bb
> >> +                     && (def_bb == bb
> >> +                         || !dominated_by_p (CDI_DOMINATORS, e->src, 
> >> def_bb)))
> >> +                   {
> >> +                     if (dump_file && (dump_flags & TDF_DETAILS))
> >> +                       {
> >> +                         fprintf (dump_file,
> >> +                                  "stmt1: Statement NOT moved to %d 
> >> [local "
> >> +                                  "count:%d] from %d [local count:%d]:\n",
> >> +                                  e->src->index, e->src->count.value (),
> >> +                                  bb->index, bb->count.value ());
> >> +                         print_gimple_stmt (dump_file, stmt, 0);
> >> +                         fprintf (dump_file, "(cost %u) out of loop 
> >> %d.\n\n",
> >> +                                  cost, level->num);
> >> +                       }
> >> +                     gsi_next (&bsi);
> >> +                     continue;
> >> +                   }
> >> +               }
> >> +             if (rhs2 && TREE_CODE (rhs2) == SSA_NAME)
> >> +               {
> >> +                 stmt2 = SSA_NAME_DEF_STMT (rhs2);
> >> +                 def_bb = gimple_bb (stmt2);
> >> +                 if (stmt2 && def_bb
> >> +                     && (def_bb == bb
> >> +                         || !dominated_by_p (CDI_DOMINATORS, e->src, 
> >> def_bb)))
> >> +                   {
> >> +                     if (dump_file && (dump_flags & TDF_DETAILS))
> >> +                       {
> >> +                         fprintf (dump_file,
> >> +                                  "stmt2: Statement NOT moved to %d 
> >> [local "
> >> +                                  "count:%d] from %d [local count:%d]:\n",
> >> +                                  e->src->index, e->src->count.value (),
> >> +                                  bb->index, bb->count.value ());
> >> +                         print_gimple_stmt (dump_file, stmt, 0);
> >> +                         fprintf (dump_file, "(cost %u) out of loop 
> >> %d.\n\n",
> >> +                                  cost, level->num);
> >> +                       }
> >> +                     gsi_next (&bsi);
> >> +                     continue;
> >> +                   }
> >> +               }
> >> +           }
> >> +       }
> >>
> >>         if (dump_file && (dump_flags & TDF_DETAILS))
> >>          {
> >> @@ -1224,7 +1361,6 @@ move_computations_worker (basic_block bb)
> >>                     cost, level->num);
> >>          }
> >>
> >> -      e = loop_preheader_edge (level);
> >>         gcc_assert (!gimple_vdef (stmt));
> >>         if (gimple_vuse (stmt))
> >>          {
> >> @@ -2094,6 +2230,19 @@ execute_sm (class loop *loop, im_mem_ref *ref,
> >>     bool multi_threaded_model_p = false;
> >>     gimple_stmt_iterator gsi;
> >>     sm_aux *aux = new sm_aux;
> >> +  basic_block bb = gimple_bb (first_mem_ref_loc (loop, ref)->stmt);
> >> +
> >> +  edge e = loop_preheader_edge (loop);
> >> +  if (e->src->count > bb->count)
> >> +    {
> >> +      if (dump_file && (dump_flags & TDF_DETAILS))
> >> +       {
> >> +         fprintf (dump_file, "Don't execute store motion of ");
> >> +         print_generic_expr (dump_file, ref->mem.ref);
> >> +         fprintf (dump_file, " from loop %d\n", loop->num);
> >> +       }
> >> +      return;
> >> +    }
> >>
> >>     if (dump_file && (dump_flags & TDF_DETAILS))
> >>       {
> >> @@ -2202,7 +2351,12 @@ execute_sm_exit (class loop *loop, edge ex, 
> >> vec<seq_entry> &seq,
> >>          }
> >>         else
> >>          {
> >> -         sm_aux *aux = *aux_map.get (ref);
> >> +         sm_aux **paux = aux_map.get (ref);
> >> +         sm_aux *aux;
> >> +         if (paux)
> >> +           aux = *paux;
> >> +         else
> >> +           continue;
> >>            if (!aux->store_flag || kind == sm_ord)
> >>              {
> >>                gassign *store;
> >> diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
> >> index 3a09bbc39e5..4cae82936b9 100644
> >> --- a/gcc/tree-ssa-loop-split.c
> >> +++ b/gcc/tree-ssa-loop-split.c
> >> @@ -577,14 +577,17 @@ split_loop (class loop *loop1)
> >>          if (!initial_true)
> >>            cond = fold_build1 (TRUTH_NOT_EXPR, boolean_type_node, cond);
> >>
> >> +       edge true_edge = EDGE_SUCC (bbs[i], 0)->flags & EDGE_TRUE_VALUE
> >> +                          ? EDGE_SUCC (bbs[i], 0)
> >> +                          : EDGE_SUCC (bbs[i], 1);
> >>          /* Now version the loop, placing loop2 after loop1 connecting
> >>             them, and fix up SSA form for that.  */
> >>          initialize_original_copy_tables ();
> >>          basic_block cond_bb;
> >>
> >>          class loop *loop2 = loop_version (loop1, cond, &cond_bb,
> >> -                                          profile_probability::always (),
> >> -                                          profile_probability::always (),
> >> +                                          true_edge->probability,
> >> +                                          true_edge->probability.invert 
> >> (),
> >>                                             profile_probability::always (),
> >>                                             profile_probability::always (),
> >>                                             true);
> >> @@ -1486,8 +1489,8 @@ do_split_loop_on_cond (struct loop *loop1, edge 
> >> invar_branch)
> >>     initialize_original_copy_tables ();
> >>
> >>     struct loop *loop2 = loop_version (loop1, boolean_true_node, NULL,
> >> -                                    profile_probability::always (),
> >> -                                    profile_probability::never (),
> >> +                                    invar_branch->probability,
> >> +                                    invar_branch->probability.invert (),
> >>                                       profile_probability::always (),
> >>                                       profile_probability::always (),
> >>                                       true);
> >> @@ -1530,6 +1533,9 @@ do_split_loop_on_cond (struct loop *loop1, edge 
> >> invar_branch)
> >>     to_loop1->flags |= true_invar ? EDGE_FALSE_VALUE : EDGE_TRUE_VALUE;
> >>     to_loop2->flags |= true_invar ? EDGE_TRUE_VALUE : EDGE_FALSE_VALUE;
> >>
> >> +  to_loop1->probability = invar_branch->probability.invert ();
> >> +  to_loop2->probability = invar_branch->probability;
> >> +
> >>     /* Due to introduction of a control flow edge from loop1 latch to loop2
> >>        pre-header, we should update PHIs in loop2 to reflect this 
> >> connection
> >>        between loop1 and loop2.  */
> >> --
> >> 2.27.0.90.geebb51ba8c
> >>

Re: [RFC] Don't move cold code out of loop by checking bb count

Reply via email to