On 2021/12/6 13:09, Xionghu Luo via Gcc-patches wrote:
>
>
> On 2021/12/1 18:09, Richard Biener wrote:
>> On Wed, Nov 10, 2021 at 4:08 AM Xionghu Luo <luo...@linux.ibm.com> wrote:
>>>
>>>
>>>
>>> On 2021/11/4 21:00, Richard Biener wrote:
>>>> On Wed, Nov 3, 2021 at 2:29 PM Xionghu Luo <luo...@linux.ibm.com> wrote:
>>>>>
>>>>>
>>>>>> + while (outmost_loop != loop)
>>>>>> + {
>>>>>> + if (bb_colder_than_loop_preheader (loop_preheader_edge
>>>>>> (outmost_loop)->src,
>>>>>> + loop_preheader_edge
>>>>>> (cold_loop)->src))
>>>>>> + cold_loop = outmost_loop;
>>>>>> + outmost_loop = superloop_at_depth (loop, loop_depth
>>>>>> (outmost_loop) + 1);
>>>>>> + }
>>>>>>
>>>>>> could be instead written as
>>>>>>
>>>>>> coldest_loop = coldest_outermost_loop[loop->num];
>>>>>> if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
>>>>>> return outermost_loop;
>>>>>> return coldest_loop;
>>>>>>
>>>>>> ? And in the usual case coldest_outermost_loop[L] would be the loop
>>>>>> tree root.
>>>>>> It should be possible to compute such cache in a DFS walk of the loop
>>>>>> tree
>>>>>> (the loop iterator by default visits in such order).
>>>>>
>>>>>
>>>>> Thanks. Updated the patch with your suggestion. Not sure whether it
>>>>> strictly
>>>>> conforms to your comments. Though the patch passed all my added
>>>>> tests(coverage not enough),
>>>>> I am still a bit worried if pre-computed coldest_loop is outside of
>>>>> outermost_loop, but
>>>>> outermost_loop is not the COLDEST LOOP, i.e. (outer->inner)
>>>>>
>>>>> [loop tree root, coldest_loop, outermost_loop,..., second_coldest_loop,
>>>>> ..., loop],
>>>>>
>>>>> then function find_coldest_out_loop will return a loop NOT accord with our
>>>>> expectation, that should return second_coldest_loop instead of
>>>>> outermost_loop?
>>>> Hmm, interesting - yes. I guess the common case will be that the
>>>> pre-computed
>>>> outermost loop will be the loop at depth 1 since outer loops tend to
>>>> be colder than
>>>> inner loops? That would then defeat the whole exercise.
>>>
>>> It is not easy to construct such cases, But finally I got below results,
>>>
>>> 1) many cases inner loop is hotter than outer loop, for example:
>>>
>>> loop 1's coldest_outermost_loop is 1, colder_than_inner_loop is NULL
>>> loop 2's coldest_outermost_loop is 1, colder_than_inner_loop is 1
>>> loop 3's coldest_outermost_loop is 1, colder_than_inner_loop is 2
>>> loop 4's coldest_outermost_loop is 1, colder_than_inner_loop is 2
>>>
>>>
>>> 2) But there are also cases inner loop is colder than outer loop, like:
>>>
>>> loop 1's coldest outermost loop is 1, colder_than_inner_loop is NULL
>>> loop 2's coldest outermost loop is 2, colder_than_inner_loop is NULL
>>> loop 3's coldest outermost loop is 3, colder_than_inner_loop is NULL
>>>
>>>
>>>>
>>>> To optimize the common case but not avoiding iteration in the cases we care
>>>> about we could instead cache the next outermost loop that is _not_ colder
>>>> than loop. So for your [ ... ] example above we'd have>
>>>> hotter_than_inner_loop[loop] == outer (second_coldest_loop), where the
>>>> candidate would then be 'second_coldest_loop' and we'd then iterate
>>>> to hotter_than_inner_loop[hotter_than_inner_loop[loop]] to find the next
>>>> cold candidate we can compare against? For the common case we'd
>>>> have hotter_than_inner_loop[looo] == NULL (no such loop) and we then
>>>> simply pick 'outermost_loop'.
>>>
>>> Thanks. It was difficult to understand, but finally I got to know what you
>>> want to express :)
>>>
>>> We should cache the next loop that is *colder* than loop instead of '_not_
>>> colder
>>> than loop', and 'hotter_than_inner_loop' should be 'colder_than_inner_loop',
>>> then it makes sense if the coldest loop is outside of outermost loop,
>>> continue to
>>> find a colder loop between outermost loop and current loop in
>>> colder_than_inner_loop[loop->num]? Hope I understood you correctly...
>>
>> Heh, looking at the patch - I don't know.
>>
>> To make the calls to bb_colder_than_loop_preheader more obvious can you
>> change that function to
>>
>> +bool bb_colder_than_loop_preheader (basic_block bb, loop *loop)
>> +{
>> + return bb->count < loop_preheader_edge (loop)->src->count;
>>
>> ?
>>
>> + if (colder_than_inner_loop[loop->num] != NULL
>> + && loop_depth (outermost_loop)
>> + < loop_depth (colder_than_inner_loop[loop->num]))
>> + return colder_than_inner_loop[loop->num];
>>
>> that is
>>
>> + 3) If COLDEST_LOOP is outside of OUTERMOST_LOOP, check whether there is a
>> + colder loop between OUTERMOST_LOOP and loop in pre-computed
>> + COLDER_THAN_INNER_LOOP, return it, otherwise return OUTERMOST_LOOP. */
>>
>> since we index colder_than_inner_loop by loop->num as well this doesn't find
>> the
>> coldest outer loop of 'loop' within the nest up to outermost_loop,
>> correct? For the
>> "common" case of each successive outer loop being colder than the immediate
>> inner loop doesn't that result in only hoisting one level in case the
>> coldest outer loop
>> is outside of the outermost loop we can legally hoist to?
>>
>> My suggestion to cache the next outer "hotter" loop was to make the search
>> for
>> the optimal hoisting position O(1) for the common case (there is no
>> hotter outer loop),
>> allowing to do
>>
>> hot_loop = next_hot_loop[loop->num];
>> if (!hot_loop || loop_depth (hot_loop) < loop_depth (outermost_loop))
>> return outermost_loop;
>>
>> and do the linear search in the other uncommon case (eventually the cache
>> can be exploited to optimize that search as well).
>>
>> The cache should guarantee that for loops between next_hot_loop[loop->num]
>> and loop each loop is colder than its inner loop, so the immediate inner loop
>> of 'hot_loop' is the coldest place in the sub-nest.
>>
>> Basically the cache should make the common case O(1) (well, I _do_ hope
>> that is the common case ;))
>>
>> Btw, can you make the cache array a vec<class loop> * please so we get
>> index checking?
>>
>
> Thanks, updated with your comments to use hotter_than_inner_loop instead of
> colder_than_inner_loop, it works.
>
> For typical case like below:
>
> for (int j = 0; j < m; j++) // Loop 1
> if (__builtin_expect (y, 0))
> for (int j2 = 0; j2 < 256; j2++) // Loop 2
> for (int i = 0; i < m; i++) // Loop 3
> for (int s = 0; s < m; s++) // Loop 4
> for (int s1 = 0; s1 < m; s1++) // Loop 5
> if (__builtin_expect (y, 0))
> for (int s2 = 0; s2 < m; s2++) // Loop 6
>
>
> The output is:
>
> loop 1's coldest_outermost_loop is 1, hotter_than_inner_loop is NULL
> loop 2's coldest_outermost_loop is 2, hotter_than_inner_loop is 1
> loop 3's coldest_outermost_loop is 2, hotter_than_inner_loop is NULL
> loop 4's coldest_outermost_loop is 2, hotter_than_inner_loop is NULL
> loop 5's coldest_outermost_loop is 2, hotter_than_inner_loop is NULL
> loop 6's coldest_outermost_loop is 2, hotter_than_inner_loop is 5
>
>
>
> v8-0002-Don-t-move-cold-code-out-of-loop-by-checking-bb-c.patch
>
>
> From: Xiong Hu Luo <luo...@linux.ibm.com>
>
> v8 changes:
> 1. Use hotter_than_inner_loop instead of colder to store a hotter loop
> nearest to loop.
> 2. Update the logic in fill_coldest_and_hotter_out_loop and
> get_coldest_out_loop to make common case O(1).
> 3. Update function argument bb_colder_than_loop_preheader.
> 4. Make cached array to vec<class *loop> for index checking.
>
> v7 changes:
> 1. Refine get_coldest_out_loop to replace loop with checking
> pre-computed coldest_outermost_loop and colder_than_inner_loop.
> 2. Add function fill_cold_out_loop, compute coldest_outermost_loop and
> colder_than_inner_loop recursively without loop.
>
> v6 changes:
> 1. Add function fill_coldest_out_loop to pre compute the coldest
> outermost loop for each loop.
> 2. Rename find_coldest_out_loop to get_coldest_out_loop.
> 3. Add testcase ssa-lim-22.c to differentiate with ssa-lim-19.c.
>
> v5 changes:
> 1. Refine comments for new functions.
> 2. Use basic_block instead of count in bb_colder_than_loop_preheader
> to align with function name.
> 3. Refine with simpler implementation for get_coldest_out_loop and
> ref_in_loop_hot_body::operator for better understanding.
>
> v4 changes:
> 1. Sort out profile_count comparision to function bb_cold_than_loop_preheader.
> 2. Update ref_in_loop_hot_body::operator () to find cold_loop before compare.
> 3. Split RTL invariant motion part out.
> 4. Remove aux changes.
>
> v3 changes:
> 1. Handle max_loop in determine_max_movement instead of
> outermost_invariant_loop.
> 2. Remove unnecessary changes.
> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
> infinite loop when implementing v1 and the iteration is missed to be
> updated actually.
>
> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
> v3: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580211.html
> v4: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581231.html
> v5: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html
>
> There was a patch trying to avoid move cold block out of loop:
>
> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>
> Richard suggested to "never hoist anything from a bb with lower execution
> frequency to a bb with higher one in LIM invariantness_dom_walker
> before_dom_children".
>
> In gimple LIM analysis, add get_coldest_out_loop to move invariants to
> expected target loop, if profile count of the loop bb is colder
> than target loop preheader, it won't be hoisted out of loop.
> Likely for store motion, if all locations of the REF in loop is cold,
> don't do store motion of it.
>
> SPEC2017 performance evaluation shows 1% performance improvement for
> intrate GEOMEAN and no obvious regression for others. Especially,
> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> on P8LE.
>
> gcc/ChangeLog:
>
> * tree-ssa-loop-im.c (bb_colder_than_loop_preheader): New
> function.
> (get_coldest_out_loop): New function.
> (determine_max_movement): Use get_coldest_out_loop.
> (move_computations_worker): Adjust and fix iteration udpate.
> (class ref_in_loop_hot_body): New functor.
> (ref_in_loop_hot_body::operator): New.
> (can_sm_ref_p): Use for_all_locs_in_loop.
> (fill_coldest_and_hotter_out_loop): New.
> (tree_ssa_lim_finalize): Free coldest_outermost_loop and
> hotter_than_inner_loop.
> (loop_invariant_motion_in_fun): Call fill_coldet_and_hotter_out_loop.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/tree-ssa/recip-3.c: Adjust.
> * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
> * gcc.dg/tree-ssa/ssa-lim-21.c: New test.
> * gcc.dg/tree-ssa/ssa-lim-22.c: New test.
> * gcc.dg/tree-ssa/ssa-lim-23.c: New test.
> ---
> gcc/tree-ssa-loop-im.c | 150 ++++++++++++++++++++-
> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +-
> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 29 ++++
> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 ++++
> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 35 +++++
> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c | 32 +++++
> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-23.c | 21 +++
> 7 files changed, 291 insertions(+), 3 deletions(-)
> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-23.c
>
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index 4b187c2cdaf..0bcc3bca91b 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -146,6 +146,11 @@ public:
> enum dep_kind { lim_raw, sm_war, sm_waw };
> enum dep_state { dep_unknown, dep_independent, dep_dependent };
>
> +/* coldest outermost loop for given loop. */
> +vec<class loop *> coldest_outermost_loop;
> +/* hotter outer loop nearest to given loop. */
> +vec<class loop *> hotter_than_inner_loop;
> +
> /* Populate the loop dependence cache of REF for LOOP, KIND with STATE. */
>
> static void
> @@ -417,6 +422,61 @@ movement_possibility (gimple *stmt)
> return ret;
> }
>
> +/* Compare the profile count inequality of bb and loop's preheader, it is
> + three-state as stated in profile-count.h, FALSE is returned if inequality
> + cannot be decided. */
> +bool
> +bb_colder_than_loop_preheader (basic_block bb, class loop *loop)
> +{
> + gcc_assert (bb && loop);
> + return bb->count < loop_preheader_edge (loop)->src->count;
> +}
> +
> +/* Check coldest loop between OUTERMOST_LOOP and LOOP by comparing profile
> + count.
> + It does three steps check:
> + 1) Check whether CURR_BB is cold in it's own loop_father, if it is cold,
> just
> + return NULL which means it should not be moved out at all;
> + 2) CURR_BB is NOT cold, check if pre-computed COLDEST_LOOP is outside of
> + OUTERMOST_LOOP, if it is inside of OUTERMOST_LOOP, return the COLDEST_LOOP;
> + 3) If COLDEST_LOOP is outside of OUTERMOST_LOOP, check whether there is a
> + hotter loop between OUTERMOST_LOOP and loop in pre-computed
> + HOTTER_THAN_INNER_LOOP, return it's nested inner loop, otherwise return
> + OUTERMOST_LOOP. */
> +
> +static class loop *
> +get_coldest_out_loop (class loop *outermost_loop, class loop *loop,
> + basic_block curr_bb)
> +{
> + gcc_assert (outermost_loop == loop
> + || flow_loop_nested_p (outermost_loop, loop));
> +
> + /* If bb_colder_than_loop_preheader returns false due to three-state
> + comparision, OUTERMOST_LOOP is returned finally to preserve the behavior.
> + Otherwise, return the coldest loop between OUTERMOST_LOOP and LOOP. */
> + if (curr_bb && bb_colder_than_loop_preheader (curr_bb, loop))
> + return NULL;
> +
> + class loop *coldest_loop = coldest_outermost_loop[loop->num];
> + if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
> + {
> + class loop *hotter_loop = hotter_than_inner_loop[loop->num];
> + if (!hotter_loop
> + || loop_depth (hotter_loop) < loop_depth (outermost_loop))
> + return outermost_loop;
> +
> + /* hotter_loop is between OUTERMOST_LOOP and LOOP like:
> + [loop tree root, ..., coldest_loop, ..., outermost_loop, ...,
> + hotter_loop, second_coldest_loop, ..., loop]
> + return second_coldest_loop to be the hoist target. */
> + class loop *aloop;
> + for (aloop = hotter_loop->inner; aloop; aloop = aloop->next)
> + if (flow_loop_nested_p (aloop, loop))
should be:
if (aloop == loop || flow_loop_nested_p (aloop, loop))
> + return aloop;
> + }
> + return coldest_loop;
> +}
> +
--
Thanks,
Xionghu