On Mon, Dec 6, 2021 at 6:26 AM Xionghu Luo <luo...@linux.ibm.com> wrote:
>
>
>
> On 2021/12/6 13:09, Xionghu Luo via Gcc-patches wrote:
> >
> >
> > On 2021/12/1 18:09, Richard Biener wrote:
> >> On Wed, Nov 10, 2021 at 4:08 AM Xionghu Luo <luo...@linux.ibm.com> wrote:
> >>>
> >>>
> >>>
> >>> On 2021/11/4 21:00, Richard Biener wrote:
> >>>> On Wed, Nov 3, 2021 at 2:29 PM Xionghu Luo <luo...@linux.ibm.com> wrote:
> >>>>>
> >>>>>
> >>>>>> +  while (outmost_loop != loop)
> >>>>>> +    {
> >>>>>> +      if (bb_colder_than_loop_preheader (loop_preheader_edge
> >>>>>> (outmost_loop)->src,
> >>>>>> +                                        loop_preheader_edge 
> >>>>>> (cold_loop)->src))
> >>>>>> +       cold_loop = outmost_loop;
> >>>>>> +      outmost_loop = superloop_at_depth (loop, loop_depth 
> >>>>>> (outmost_loop) + 1);
> >>>>>> +    }
> >>>>>>
> >>>>>> could be instead written as
> >>>>>>
> >>>>>>   coldest_loop = coldest_outermost_loop[loop->num];
> >>>>>>   if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
> >>>>>>     return outermost_loop;
> >>>>>>   return coldest_loop;
> >>>>>>
> >>>>>> ?  And in the usual case coldest_outermost_loop[L] would be the loop 
> >>>>>> tree root.
> >>>>>> It should be possible to compute such cache in a DFS walk of the loop 
> >>>>>> tree
> >>>>>> (the loop iterator by default visits in such order).
> >>>>>
> >>>>>
> >>>>> Thanks.  Updated the patch with your suggestion.  Not sure whether it 
> >>>>> strictly
> >>>>> conforms to your comments.  Though the patch passed all my added 
> >>>>> tests(coverage not enough),
> >>>>> I am still a bit worried if pre-computed coldest_loop is outside of 
> >>>>> outermost_loop, but
> >>>>> outermost_loop is not the COLDEST LOOP, i.e. (outer->inner)
> >>>>>
> >>>>>  [loop tree root, coldest_loop, outermost_loop,..., 
> >>>>> second_coldest_loop, ..., loop],
> >>>>>
> >>>>> then function find_coldest_out_loop will return a loop NOT accord with 
> >>>>> our
> >>>>> expectation, that should return second_coldest_loop instead of 
> >>>>> outermost_loop?
> >>>> Hmm, interesting - yes.  I guess the common case will be that the 
> >>>> pre-computed
> >>>> outermost loop will be the loop at depth 1 since outer loops tend to
> >>>> be colder than
> >>>> inner loops?  That would then defeat the whole exercise.
> >>>
> >>> It is not easy to construct such cases, But finally I got below results,
> >>>
> >>> 1) many cases inner loop is hotter than outer loop, for example:
> >>>
> >>> loop 1's coldest_outermost_loop is 1, colder_than_inner_loop is NULL
> >>> loop 2's coldest_outermost_loop is 1, colder_than_inner_loop is 1
> >>> loop 3's coldest_outermost_loop is 1, colder_than_inner_loop is 2
> >>> loop 4's coldest_outermost_loop is 1, colder_than_inner_loop is 2
> >>>
> >>>
> >>> 2) But there are also cases inner loop is colder than outer loop, like:
> >>>
> >>> loop 1's coldest outermost loop is 1, colder_than_inner_loop is NULL
> >>> loop 2's coldest outermost loop is 2, colder_than_inner_loop is NULL
> >>> loop 3's coldest outermost loop is 3, colder_than_inner_loop is NULL
> >>>
> >>>
> >>>>
> >>>> To optimize the common case but not avoiding iteration in the cases we 
> >>>> care
> >>>> about we could instead cache the next outermost loop that is _not_ colder
> >>>> than loop.  So for your [ ... ] example above we'd have> 
> >>>> hotter_than_inner_loop[loop] == outer (second_coldest_loop), where the
> >>>> candidate would then be 'second_coldest_loop' and we'd then iterate
> >>>> to hotter_than_inner_loop[hotter_than_inner_loop[loop]] to find the next
> >>>> cold candidate we can compare against?  For the common case we'd
> >>>> have hotter_than_inner_loop[looo] == NULL (no such loop) and we then
> >>>> simply pick 'outermost_loop'.
> >>>
> >>> Thanks.  It was difficult to understand, but finally I got to know what 
> >>> you
> >>> want to express :)
> >>>
> >>> We should cache the next loop that is *colder* than loop instead of 
> >>> '_not_ colder
> >>> than loop', and 'hotter_than_inner_loop' should be 
> >>> 'colder_than_inner_loop',
> >>> then it makes sense if the coldest loop is outside of outermost loop, 
> >>> continue to
> >>> find a colder loop between outermost loop and current loop in
> >>> colder_than_inner_loop[loop->num]?  Hope I understood you correctly...
> >>
> >> Heh, looking at the patch - I don't know.
> >>
> >> To make the calls to bb_colder_than_loop_preheader more obvious can you
> >> change that function to
> >>
> >> +bool bb_colder_than_loop_preheader (basic_block bb, loop *loop)
> >> +{
> >> +  return bb->count < loop_preheader_edge (loop)->src->count;
> >>
> >> ?
> >>
> >> +      if (colder_than_inner_loop[loop->num] != NULL
> >> +         && loop_depth (outermost_loop)
> >> +              < loop_depth (colder_than_inner_loop[loop->num]))
> >> +       return colder_than_inner_loop[loop->num];
> >>
> >> that is
> >>
> >> +  3) If COLDEST_LOOP is outside of OUTERMOST_LOOP, check whether there is 
> >> a
> >> +  colder loop between OUTERMOST_LOOP and loop in pre-computed
> >> +  COLDER_THAN_INNER_LOOP, return it, otherwise return OUTERMOST_LOOP.  */
> >>
> >> since we index colder_than_inner_loop by loop->num as well this doesn't 
> >> find the
> >> coldest outer loop of 'loop' within the nest up to outermost_loop,
> >> correct?  For the
> >> "common" case of each successive outer loop being colder than the immediate
> >> inner loop doesn't that result in only hoisting one level in case the
> >> coldest outer loop
> >> is outside of the outermost loop we can legally hoist to?
> >>
> >> My suggestion to cache the next outer "hotter" loop was to make the search 
> >> for
> >> the optimal hoisting position O(1) for the common case (there is no
> >> hotter outer loop),
> >> allowing to do
> >>
> >>    hot_loop = next_hot_loop[loop->num];
> >>    if (!hot_loop || loop_depth (hot_loop) < loop_depth (outermost_loop))
> >>      return outermost_loop;
> >>
> >> and do the linear search in the other uncommon case (eventually the cache
> >> can be exploited to optimize that search as well).
> >>
> >> The cache should guarantee that for loops between next_hot_loop[loop->num]
> >> and loop each loop is colder than its inner loop, so the immediate inner 
> >> loop
> >> of 'hot_loop' is the coldest place in the sub-nest.
> >>
> >> Basically the cache should make the common case O(1) (well, I _do_ hope
> >> that is the common case ;))
> >>
> >> Btw, can you make the cache array a vec<class loop> * please so we get
> >> index checking?
> >>
> >
> > Thanks, updated with your comments to use hotter_than_inner_loop instead of
> > colder_than_inner_loop, it works.
> >
> > For typical case like below:
> >
> > for (int j = 0; j < m; j++) // Loop 1
> >   if (__builtin_expect (y, 0))
> >    for (int j2 = 0; j2 < 256; j2++) // Loop 2
> >       for (int i = 0; i < m; i++) // Loop 3
> >         for (int s = 0; s < m; s++) // Loop 4
> >           for (int s1 = 0; s1 < m; s1++) // Loop 5
> >             if (__builtin_expect (y, 0))
> >               for (int s2 = 0; s2 < m; s2++) // Loop 6
> >
> >
> > The output is:
> >
> > loop 1's coldest_outermost_loop is 1, hotter_than_inner_loop is NULL
> > loop 2's coldest_outermost_loop is 2, hotter_than_inner_loop is 1
> > loop 3's coldest_outermost_loop is 2, hotter_than_inner_loop is NULL
> > loop 4's coldest_outermost_loop is 2, hotter_than_inner_loop is NULL
> > loop 5's coldest_outermost_loop is 2, hotter_than_inner_loop is NULL
> > loop 6's coldest_outermost_loop is 2, hotter_than_inner_loop is 5
> >
> >
> >
> > v8-0002-Don-t-move-cold-code-out-of-loop-by-checking-bb-c.patch
> >
> >
> > From: Xiong Hu Luo <luo...@linux.ibm.com>
> >
> > v8 changes:
> > 1. Use hotter_than_inner_loop instead of colder to store a hotter loop
> > nearest to loop.
> > 2. Update the logic in fill_coldest_and_hotter_out_loop and
> > get_coldest_out_loop to make common case O(1).
> > 3. Update function argument bb_colder_than_loop_preheader.
> > 4. Make cached array to vec<class *loop> for index checking.
> >
> > v7 changes:
> > 1. Refine get_coldest_out_loop to replace loop with checking
> > pre-computed coldest_outermost_loop and colder_than_inner_loop.
> > 2. Add function fill_cold_out_loop, compute coldest_outermost_loop and
> > colder_than_inner_loop recursively without loop.
> >
> > v6 changes:
> > 1. Add function fill_coldest_out_loop to pre compute the coldest
> > outermost loop for each loop.
> > 2. Rename find_coldest_out_loop to get_coldest_out_loop.
> > 3. Add testcase ssa-lim-22.c to differentiate with ssa-lim-19.c.
> >
> > v5 changes:
> > 1. Refine comments for new functions.
> > 2. Use basic_block instead of count in bb_colder_than_loop_preheader
> > to align with function name.
> > 3. Refine with simpler implementation for get_coldest_out_loop and
> > ref_in_loop_hot_body::operator for better understanding.
> >
> > v4 changes:
> > 1. Sort out profile_count comparision to function 
> > bb_cold_than_loop_preheader.
> > 2. Update ref_in_loop_hot_body::operator () to find cold_loop before 
> > compare.
> > 3. Split RTL invariant motion part out.
> > 4. Remove aux changes.
> >
> > v3 changes:
> > 1. Handle max_loop in determine_max_movement instead of 
> > outermost_invariant_loop.
> > 2. Remove unnecessary changes.
> > 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in 
> > can_sm_ref_p.
> > 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
> > infinite loop when implementing v1 and the iteration is missed to be
> > updated actually.
> >
> > v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
> > v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
> > v3: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580211.html
> > v4: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581231.html
> > v5: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html
> >
> > There was a patch trying to avoid move cold block out of loop:
> >
> > https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> >
> > Richard suggested to "never hoist anything from a bb with lower execution
> > frequency to a bb with higher one in LIM invariantness_dom_walker
> > before_dom_children".
> >
> > In gimple LIM analysis, add get_coldest_out_loop to move invariants to
> > expected target loop, if profile count of the loop bb is colder
> > than target loop preheader, it won't be hoisted out of loop.
> > Likely for store motion, if all locations of the REF in loop is cold,
> > don't do store motion of it.
> >
> > SPEC2017 performance evaluation shows 1% performance improvement for
> > intrate GEOMEAN and no obvious regression for others.  Especially,
> > 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> > largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> > on P8LE.
> >
> > gcc/ChangeLog:
> >
> >       * tree-ssa-loop-im.c (bb_colder_than_loop_preheader): New
> >       function.
> >       (get_coldest_out_loop): New function.
> >       (determine_max_movement): Use get_coldest_out_loop.
> >       (move_computations_worker): Adjust and fix iteration udpate.
> >       (class ref_in_loop_hot_body): New functor.
> >       (ref_in_loop_hot_body::operator): New.
> >       (can_sm_ref_p): Use for_all_locs_in_loop.
> >       (fill_coldest_and_hotter_out_loop): New.
> >       (tree_ssa_lim_finalize): Free coldest_outermost_loop and
> >       hotter_than_inner_loop.
> >       (loop_invariant_motion_in_fun): Call fill_coldet_and_hotter_out_loop.
> >
> > gcc/testsuite/ChangeLog:
> >
> >       * gcc.dg/tree-ssa/recip-3.c: Adjust.
> >       * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> >       * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
> >       * gcc.dg/tree-ssa/ssa-lim-21.c: New test.
> >       * gcc.dg/tree-ssa/ssa-lim-22.c: New test.
> >       * gcc.dg/tree-ssa/ssa-lim-23.c: New test.
> > ---
> >  gcc/tree-ssa-loop-im.c                     | 150 ++++++++++++++++++++-
> >  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |   2 +-
> >  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c |  29 ++++
> >  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c |  25 ++++
> >  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c |  35 +++++
> >  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c |  32 +++++
> >  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-23.c |  21 +++
> >  7 files changed, 291 insertions(+), 3 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-23.c
> >
> > diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> > index 4b187c2cdaf..0bcc3bca91b 100644
> > --- a/gcc/tree-ssa-loop-im.c
> > +++ b/gcc/tree-ssa-loop-im.c
> > @@ -146,6 +146,11 @@ public:
> >  enum dep_kind { lim_raw, sm_war, sm_waw };
> >  enum dep_state { dep_unknown, dep_independent, dep_dependent };
> >
> > +/* coldest outermost loop for given loop.  */
> > +vec<class loop *> coldest_outermost_loop;
> > +/* hotter outer loop nearest to given loop.  */
> > +vec<class loop *> hotter_than_inner_loop;
> > +
> >  /* Populate the loop dependence cache of REF for LOOP, KIND with STATE.  */
> >
> >  static void
> > @@ -417,6 +422,61 @@ movement_possibility (gimple *stmt)
> >    return ret;
> >  }
> >
> > +/* Compare the profile count inequality of bb and loop's preheader, it is
> > +   three-state as stated in profile-count.h, FALSE is returned if 
> > inequality
> > +   cannot be decided.  */
> > +bool
> > +bb_colder_than_loop_preheader (basic_block bb, class loop *loop)
> > +{
> > +  gcc_assert (bb && loop);
> > +  return bb->count < loop_preheader_edge (loop)->src->count;
> > +}
> > +
> > +/* Check coldest loop between OUTERMOST_LOOP and LOOP by comparing profile
> > +   count.
> > +  It does three steps check:
> > +  1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, 
> > just
> > +  return NULL which means it should not be moved out at all;
> > +  2) CURR_BB is NOT cold, check if pre-computed COLDEST_LOOP is outside of
> > +  OUTERMOST_LOOP, if it is inside of OUTERMOST_LOOP, return the 
> > COLDEST_LOOP;
> > +  3) If COLDEST_LOOP is outside of OUTERMOST_LOOP, check whether there is a
> > +  hotter loop between OUTERMOST_LOOP and loop in pre-computed
> > +  HOTTER_THAN_INNER_LOOP, return it's nested inner loop, otherwise return
> > +  OUTERMOST_LOOP.  */
> > +
> > +static class loop *
> > +get_coldest_out_loop (class loop *outermost_loop, class loop *loop,
> > +                   basic_block curr_bb)
> > +{
> > +  gcc_assert (outermost_loop == loop
> > +           || flow_loop_nested_p (outermost_loop, loop));
> > +
> > +  /* If bb_colder_than_loop_preheader returns false due to three-state
> > +    comparision, OUTERMOST_LOOP is returned finally to preserve the 
> > behavior.
> > +    Otherwise, return the coldest loop between OUTERMOST_LOOP and LOOP.  */
> > +  if (curr_bb && bb_colder_than_loop_preheader (curr_bb, loop))
> > +    return NULL;
> > +
> > +  class loop *coldest_loop = coldest_outermost_loop[loop->num];
> > +  if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
> > +    {
> > +      class loop *hotter_loop = hotter_than_inner_loop[loop->num];
> > +      if (!hotter_loop
> > +       || loop_depth (hotter_loop) < loop_depth (outermost_loop))
> > +     return outermost_loop;
> > +
> > +      /*  hotter_loop is between OUTERMOST_LOOP and LOOP like:
> > +     [loop tree root, ..., coldest_loop, ..., outermost_loop, ...,
> > +     hotter_loop, second_coldest_loop, ..., loop]
> > +     return second_coldest_loop to be the hoist target.  */
> > +      class loop *aloop;
> > +      for (aloop = hotter_loop->inner; aloop; aloop = aloop->next)
> > +     if (flow_loop_nested_p (aloop, loop))
>
> should be:
>
>         if (aloop == loop || flow_loop_nested_p (aloop, loop))

OK with that fixed.

Are necessary prerequesites committed to avoid regressions?
I guess we need to keep a watchful eye and eventually revert
(or gate with a --param disabled by default) the new behavior if
severe regressions are discovered.

Thanks and sorry for the repeated delays.
Richard.

>
> > +       return aloop;
> > +    }
> > +  return coldest_loop;
> > +}
> > +
>
>
> --
> Thanks,
> Xionghu

Reply via email to