On Mon, Dec 6, 2021 at 6:26 AM Xionghu Luo <luo...@linux.ibm.com> wrote: > > > > On 2021/12/6 13:09, Xionghu Luo via Gcc-patches wrote: > > > > > > On 2021/12/1 18:09, Richard Biener wrote: > >> On Wed, Nov 10, 2021 at 4:08 AM Xionghu Luo <luo...@linux.ibm.com> wrote: > >>> > >>> > >>> > >>> On 2021/11/4 21:00, Richard Biener wrote: > >>>> On Wed, Nov 3, 2021 at 2:29 PM Xionghu Luo <luo...@linux.ibm.com> wrote: > >>>>> > >>>>> > >>>>>> + while (outmost_loop != loop) > >>>>>> + { > >>>>>> + if (bb_colder_than_loop_preheader (loop_preheader_edge > >>>>>> (outmost_loop)->src, > >>>>>> + loop_preheader_edge > >>>>>> (cold_loop)->src)) > >>>>>> + cold_loop = outmost_loop; > >>>>>> + outmost_loop = superloop_at_depth (loop, loop_depth > >>>>>> (outmost_loop) + 1); > >>>>>> + } > >>>>>> > >>>>>> could be instead written as > >>>>>> > >>>>>> coldest_loop = coldest_outermost_loop[loop->num]; > >>>>>> if (loop_depth (coldest_loop) < loop_depth (outermost_loop)) > >>>>>> return outermost_loop; > >>>>>> return coldest_loop; > >>>>>> > >>>>>> ? And in the usual case coldest_outermost_loop[L] would be the loop > >>>>>> tree root. > >>>>>> It should be possible to compute such cache in a DFS walk of the loop > >>>>>> tree > >>>>>> (the loop iterator by default visits in such order). > >>>>> > >>>>> > >>>>> Thanks. Updated the patch with your suggestion. Not sure whether it > >>>>> strictly > >>>>> conforms to your comments. Though the patch passed all my added > >>>>> tests(coverage not enough), > >>>>> I am still a bit worried if pre-computed coldest_loop is outside of > >>>>> outermost_loop, but > >>>>> outermost_loop is not the COLDEST LOOP, i.e. (outer->inner) > >>>>> > >>>>> [loop tree root, coldest_loop, outermost_loop,..., > >>>>> second_coldest_loop, ..., loop], > >>>>> > >>>>> then function find_coldest_out_loop will return a loop NOT accord with > >>>>> our > >>>>> expectation, that should return second_coldest_loop instead of > >>>>> outermost_loop? > >>>> Hmm, interesting - yes. I guess the common case will be that the > >>>> pre-computed > >>>> outermost loop will be the loop at depth 1 since outer loops tend to > >>>> be colder than > >>>> inner loops? That would then defeat the whole exercise. > >>> > >>> It is not easy to construct such cases, But finally I got below results, > >>> > >>> 1) many cases inner loop is hotter than outer loop, for example: > >>> > >>> loop 1's coldest_outermost_loop is 1, colder_than_inner_loop is NULL > >>> loop 2's coldest_outermost_loop is 1, colder_than_inner_loop is 1 > >>> loop 3's coldest_outermost_loop is 1, colder_than_inner_loop is 2 > >>> loop 4's coldest_outermost_loop is 1, colder_than_inner_loop is 2 > >>> > >>> > >>> 2) But there are also cases inner loop is colder than outer loop, like: > >>> > >>> loop 1's coldest outermost loop is 1, colder_than_inner_loop is NULL > >>> loop 2's coldest outermost loop is 2, colder_than_inner_loop is NULL > >>> loop 3's coldest outermost loop is 3, colder_than_inner_loop is NULL > >>> > >>> > >>>> > >>>> To optimize the common case but not avoiding iteration in the cases we > >>>> care > >>>> about we could instead cache the next outermost loop that is _not_ colder > >>>> than loop. So for your [ ... ] example above we'd have> > >>>> hotter_than_inner_loop[loop] == outer (second_coldest_loop), where the > >>>> candidate would then be 'second_coldest_loop' and we'd then iterate > >>>> to hotter_than_inner_loop[hotter_than_inner_loop[loop]] to find the next > >>>> cold candidate we can compare against? For the common case we'd > >>>> have hotter_than_inner_loop[looo] == NULL (no such loop) and we then > >>>> simply pick 'outermost_loop'. > >>> > >>> Thanks. It was difficult to understand, but finally I got to know what > >>> you > >>> want to express :) > >>> > >>> We should cache the next loop that is *colder* than loop instead of > >>> '_not_ colder > >>> than loop', and 'hotter_than_inner_loop' should be > >>> 'colder_than_inner_loop', > >>> then it makes sense if the coldest loop is outside of outermost loop, > >>> continue to > >>> find a colder loop between outermost loop and current loop in > >>> colder_than_inner_loop[loop->num]? Hope I understood you correctly... > >> > >> Heh, looking at the patch - I don't know. > >> > >> To make the calls to bb_colder_than_loop_preheader more obvious can you > >> change that function to > >> > >> +bool bb_colder_than_loop_preheader (basic_block bb, loop *loop) > >> +{ > >> + return bb->count < loop_preheader_edge (loop)->src->count; > >> > >> ? > >> > >> + if (colder_than_inner_loop[loop->num] != NULL > >> + && loop_depth (outermost_loop) > >> + < loop_depth (colder_than_inner_loop[loop->num])) > >> + return colder_than_inner_loop[loop->num]; > >> > >> that is > >> > >> + 3) If COLDEST_LOOP is outside of OUTERMOST_LOOP, check whether there is > >> a > >> + colder loop between OUTERMOST_LOOP and loop in pre-computed > >> + COLDER_THAN_INNER_LOOP, return it, otherwise return OUTERMOST_LOOP. */ > >> > >> since we index colder_than_inner_loop by loop->num as well this doesn't > >> find the > >> coldest outer loop of 'loop' within the nest up to outermost_loop, > >> correct? For the > >> "common" case of each successive outer loop being colder than the immediate > >> inner loop doesn't that result in only hoisting one level in case the > >> coldest outer loop > >> is outside of the outermost loop we can legally hoist to? > >> > >> My suggestion to cache the next outer "hotter" loop was to make the search > >> for > >> the optimal hoisting position O(1) for the common case (there is no > >> hotter outer loop), > >> allowing to do > >> > >> hot_loop = next_hot_loop[loop->num]; > >> if (!hot_loop || loop_depth (hot_loop) < loop_depth (outermost_loop)) > >> return outermost_loop; > >> > >> and do the linear search in the other uncommon case (eventually the cache > >> can be exploited to optimize that search as well). > >> > >> The cache should guarantee that for loops between next_hot_loop[loop->num] > >> and loop each loop is colder than its inner loop, so the immediate inner > >> loop > >> of 'hot_loop' is the coldest place in the sub-nest. > >> > >> Basically the cache should make the common case O(1) (well, I _do_ hope > >> that is the common case ;)) > >> > >> Btw, can you make the cache array a vec<class loop> * please so we get > >> index checking? > >> > > > > Thanks, updated with your comments to use hotter_than_inner_loop instead of > > colder_than_inner_loop, it works. > > > > For typical case like below: > > > > for (int j = 0; j < m; j++) // Loop 1 > > if (__builtin_expect (y, 0)) > > for (int j2 = 0; j2 < 256; j2++) // Loop 2 > > for (int i = 0; i < m; i++) // Loop 3 > > for (int s = 0; s < m; s++) // Loop 4 > > for (int s1 = 0; s1 < m; s1++) // Loop 5 > > if (__builtin_expect (y, 0)) > > for (int s2 = 0; s2 < m; s2++) // Loop 6 > > > > > > The output is: > > > > loop 1's coldest_outermost_loop is 1, hotter_than_inner_loop is NULL > > loop 2's coldest_outermost_loop is 2, hotter_than_inner_loop is 1 > > loop 3's coldest_outermost_loop is 2, hotter_than_inner_loop is NULL > > loop 4's coldest_outermost_loop is 2, hotter_than_inner_loop is NULL > > loop 5's coldest_outermost_loop is 2, hotter_than_inner_loop is NULL > > loop 6's coldest_outermost_loop is 2, hotter_than_inner_loop is 5 > > > > > > > > v8-0002-Don-t-move-cold-code-out-of-loop-by-checking-bb-c.patch > > > > > > From: Xiong Hu Luo <luo...@linux.ibm.com> > > > > v8 changes: > > 1. Use hotter_than_inner_loop instead of colder to store a hotter loop > > nearest to loop. > > 2. Update the logic in fill_coldest_and_hotter_out_loop and > > get_coldest_out_loop to make common case O(1). > > 3. Update function argument bb_colder_than_loop_preheader. > > 4. Make cached array to vec<class *loop> for index checking. > > > > v7 changes: > > 1. Refine get_coldest_out_loop to replace loop with checking > > pre-computed coldest_outermost_loop and colder_than_inner_loop. > > 2. Add function fill_cold_out_loop, compute coldest_outermost_loop and > > colder_than_inner_loop recursively without loop. > > > > v6 changes: > > 1. Add function fill_coldest_out_loop to pre compute the coldest > > outermost loop for each loop. > > 2. Rename find_coldest_out_loop to get_coldest_out_loop. > > 3. Add testcase ssa-lim-22.c to differentiate with ssa-lim-19.c. > > > > v5 changes: > > 1. Refine comments for new functions. > > 2. Use basic_block instead of count in bb_colder_than_loop_preheader > > to align with function name. > > 3. Refine with simpler implementation for get_coldest_out_loop and > > ref_in_loop_hot_body::operator for better understanding. > > > > v4 changes: > > 1. Sort out profile_count comparision to function > > bb_cold_than_loop_preheader. > > 2. Update ref_in_loop_hot_body::operator () to find cold_loop before > > compare. > > 3. Split RTL invariant motion part out. > > 4. Remove aux changes. > > > > v3 changes: > > 1. Handle max_loop in determine_max_movement instead of > > outermost_invariant_loop. > > 2. Remove unnecessary changes. > > 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in > > can_sm_ref_p. > > 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused > > infinite loop when implementing v1 and the iteration is missed to be > > updated actually. > > > > v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html > > v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html > > v3: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580211.html > > v4: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581231.html > > v5: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html > > > > There was a patch trying to avoid move cold block out of loop: > > > > https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html > > > > Richard suggested to "never hoist anything from a bb with lower execution > > frequency to a bb with higher one in LIM invariantness_dom_walker > > before_dom_children". > > > > In gimple LIM analysis, add get_coldest_out_loop to move invariants to > > expected target loop, if profile count of the loop bb is colder > > than target loop preheader, it won't be hoisted out of loop. > > Likely for store motion, if all locations of the REF in loop is cold, > > don't do store motion of it. > > > > SPEC2017 performance evaluation shows 1% performance improvement for > > intrate GEOMEAN and no obvious regression for others. Especially, > > 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is > > largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% > > on P8LE. > > > > gcc/ChangeLog: > > > > * tree-ssa-loop-im.c (bb_colder_than_loop_preheader): New > > function. > > (get_coldest_out_loop): New function. > > (determine_max_movement): Use get_coldest_out_loop. > > (move_computations_worker): Adjust and fix iteration udpate. > > (class ref_in_loop_hot_body): New functor. > > (ref_in_loop_hot_body::operator): New. > > (can_sm_ref_p): Use for_all_locs_in_loop. > > (fill_coldest_and_hotter_out_loop): New. > > (tree_ssa_lim_finalize): Free coldest_outermost_loop and > > hotter_than_inner_loop. > > (loop_invariant_motion_in_fun): Call fill_coldet_and_hotter_out_loop. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.dg/tree-ssa/recip-3.c: Adjust. > > * gcc.dg/tree-ssa/ssa-lim-19.c: New test. > > * gcc.dg/tree-ssa/ssa-lim-20.c: New test. > > * gcc.dg/tree-ssa/ssa-lim-21.c: New test. > > * gcc.dg/tree-ssa/ssa-lim-22.c: New test. > > * gcc.dg/tree-ssa/ssa-lim-23.c: New test. > > --- > > gcc/tree-ssa-loop-im.c | 150 ++++++++++++++++++++- > > gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- > > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 29 ++++ > > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 ++++ > > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 35 +++++ > > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c | 32 +++++ > > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-23.c | 21 +++ > > 7 files changed, 291 insertions(+), 3 deletions(-) > > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c > > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c > > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-23.c > > > > diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > > index 4b187c2cdaf..0bcc3bca91b 100644 > > --- a/gcc/tree-ssa-loop-im.c > > +++ b/gcc/tree-ssa-loop-im.c > > @@ -146,6 +146,11 @@ public: > > enum dep_kind { lim_raw, sm_war, sm_waw }; > > enum dep_state { dep_unknown, dep_independent, dep_dependent }; > > > > +/* coldest outermost loop for given loop. */ > > +vec<class loop *> coldest_outermost_loop; > > +/* hotter outer loop nearest to given loop. */ > > +vec<class loop *> hotter_than_inner_loop; > > + > > /* Populate the loop dependence cache of REF for LOOP, KIND with STATE. */ > > > > static void > > @@ -417,6 +422,61 @@ movement_possibility (gimple *stmt) > > return ret; > > } > > > > +/* Compare the profile count inequality of bb and loop's preheader, it is > > + three-state as stated in profile-count.h, FALSE is returned if > > inequality > > + cannot be decided. */ > > +bool > > +bb_colder_than_loop_preheader (basic_block bb, class loop *loop) > > +{ > > + gcc_assert (bb && loop); > > + return bb->count < loop_preheader_edge (loop)->src->count; > > +} > > + > > +/* Check coldest loop between OUTERMOST_LOOP and LOOP by comparing profile > > + count. > > + It does three steps check: > > + 1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, > > just > > + return NULL which means it should not be moved out at all; > > + 2) CURR_BB is NOT cold, check if pre-computed COLDEST_LOOP is outside of > > + OUTERMOST_LOOP, if it is inside of OUTERMOST_LOOP, return the > > COLDEST_LOOP; > > + 3) If COLDEST_LOOP is outside of OUTERMOST_LOOP, check whether there is a > > + hotter loop between OUTERMOST_LOOP and loop in pre-computed > > + HOTTER_THAN_INNER_LOOP, return it's nested inner loop, otherwise return > > + OUTERMOST_LOOP. */ > > + > > +static class loop * > > +get_coldest_out_loop (class loop *outermost_loop, class loop *loop, > > + basic_block curr_bb) > > +{ > > + gcc_assert (outermost_loop == loop > > + || flow_loop_nested_p (outermost_loop, loop)); > > + > > + /* If bb_colder_than_loop_preheader returns false due to three-state > > + comparision, OUTERMOST_LOOP is returned finally to preserve the > > behavior. > > + Otherwise, return the coldest loop between OUTERMOST_LOOP and LOOP. */ > > + if (curr_bb && bb_colder_than_loop_preheader (curr_bb, loop)) > > + return NULL; > > + > > + class loop *coldest_loop = coldest_outermost_loop[loop->num]; > > + if (loop_depth (coldest_loop) < loop_depth (outermost_loop)) > > + { > > + class loop *hotter_loop = hotter_than_inner_loop[loop->num]; > > + if (!hotter_loop > > + || loop_depth (hotter_loop) < loop_depth (outermost_loop)) > > + return outermost_loop; > > + > > + /* hotter_loop is between OUTERMOST_LOOP and LOOP like: > > + [loop tree root, ..., coldest_loop, ..., outermost_loop, ..., > > + hotter_loop, second_coldest_loop, ..., loop] > > + return second_coldest_loop to be the hoist target. */ > > + class loop *aloop; > > + for (aloop = hotter_loop->inner; aloop; aloop = aloop->next) > > + if (flow_loop_nested_p (aloop, loop)) > > should be: > > if (aloop == loop || flow_loop_nested_p (aloop, loop))
OK with that fixed. Are necessary prerequesites committed to avoid regressions? I guess we need to keep a watchful eye and eventually revert (or gate with a --param disabled by default) the new behavior if severe regressions are discovered. Thanks and sorry for the repeated delays. Richard. > > > + return aloop; > > + } > > + return coldest_loop; > > +} > > + > > > -- > Thanks, > Xionghu