Re: [PATCH v2] Fix incomplete computation in fill_always_executed_in_1

Xionghu Luo via Gcc-patches Wed, 18 Aug 2021 22:24:01 -0700

On 2021/8/17 17:10, Xionghu Luo via Gcc-patches wrote:
> 
> 
> On 2021/8/17 15:12, Richard Biener wrote:
>> On Tue, 17 Aug 2021, Xionghu Luo wrote:
>>
>>> Hi,
>>>
>>> On 2021/8/16 19:46, Richard Biener wrote:
>>>> On Mon, 16 Aug 2021, Xiong Hu Luo wrote:
>>>>
>>>>> It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for
>>>>> nested loops.  inn_loop is updated to inner loop, so it need be restored
>>>>> when exiting from innermost loop. With this patch, the store instruction
>>>>> in outer loop could also be moved out of outer loop by store motion.
>>>>> Any comments?  Thanks.
>>>>
>>>>> gcc/ChangeLog:
>>>>>
>>>>>    * tree-ssa-loop-im.c (fill_always_executed_in_1): Restore
>>>>>    inn_loop when exiting from innermost loop.
>>>>>
>>>>> gcc/testsuite/ChangeLog:
>>>>>
>>>>>   * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
>>>>> ---
>>>>>     gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++++++++++++++++++
>>>>>     gcc/tree-ssa-loop-im.c                     |  6 +++++-
>>>>>     2 files changed, 29 insertions(+), 1 deletion(-)
>>>>>     create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>>>>
>>>>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>>>> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>>>> new file mode 100644
>>>>> index 00000000000..097a5ee4a4b
>>>>> --- /dev/null
>>>>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>>>> @@ -0,0 +1,24 @@
>>>>> +/* PR/101293 */
>>>>> +/* { dg-do compile } */
>>>>> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
>>>>> +
>>>>> +struct X { int i; int j; int k;};
>>>>> +
>>>>> +void foo(struct X *x, int n, int l)
>>>>> +{
>>>>> +  for (int j = 0; j < l; j++)
>>>>> +    {
>>>>> +      for (int i = 0; i < n; ++i)
>>>>> + {
>>>>> +   int *p = &x->j;
>>>>> +   int tem = *p;
>>>>> +   x->j += tem * i;
>>>>> + }
>>>>> +      int *r = &x->k;
>>>>> +      int tem2 = *r;
>>>>> +      x->k += tem2 * j;
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } 
>>>>> }
>>>>> */
>>>>> +
>>>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
>>>>> index b24bc64f2a7..5ca4738b20e 100644
>>>>> --- a/gcc/tree-ssa-loop-im.c
>>>>> +++ b/gcc/tree-ssa-loop-im.c
>>>>> @@ -3211,6 +3211,10 @@ fill_always_executed_in_1 (class loop *loop, 
>>>>> sbitmap
>>>>> @@ contains_call)
>>>>>        if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>>>          last = bb;
>>>>>     +       if (inn_loop != loop
>>>>> +       && flow_loop_nested_p (bb->loop_father, inn_loop))
>>>>> +     inn_loop = bb->loop_father;
>>>>> +
>>>>
>>>> The comment says
>>>>
>>>>                  /* In a loop that is always entered we may proceed anyway.
>>>>                     But record that we entered it and stop once we leave 
>>>> it.
>>>> */
>>>>                  inn_loop = bb->loop_father;
>>>>
>>>> and your change would defeat that early return, no?
>>>
>>> The issue is the search method exits too early when iterating the outer
>>> loop.  For example of a nested loop, loop 1 includes 5,8,3,10,4,9
>>> and loop2 includes 3,10.  Currently, it breaks when bb is 3 as bb 3
>>> doesn't dominate bb 9 of loop 1.  But actually, both bb 5 and bb 4 are
>>> ALWAYS_EXECUTED for loop 1, so if there are store instructions in bb 4
>>> they won't be processed by store motion again.
>>>
>>>
>>>       5<----
>>>       |\   |
>>>       8 \  9
>>>       |  \ |
>>> --->3--->4
>>> |    |
>>> 10---|
>>>
>>>
>>> SET_ALWAYS_EXECUTED_IN is only set to bb 5 on master code now, with this
>>> patch, it will continue search when meet bb 3 until bb 4, then last is 
>>> updated
>>> to bb 4, it will break until exit edge is found at bb 4 by
>>> "if (!flow_bb_inside_loop_p (loop, e->dest))".  Then the followed loop code
>>> will
>>> set bb 4 as ALWAYS_EXEUCTED and all it's idoms bb 5.
>>>
>>>
>>>        while (1)
>>>     {
>>>       SET_ALWAYS_EXECUTED_IN (last, loop);
>>>       if (last == loop->header)
>>>         break;
>>>       last = get_immediate_dominator (CDI_DOMINATORS, last);
>>>     }
>>>
>>> After further discussion with Kewen, we found that the inn_loop variable is
>>> totally useless and could be removed.
>>>
>>>
>>>>
>>>>>        if (bitmap_bit_p (contains_call, bb->index))
>>>>>          break;
>>>>>     
>>>>> @@ -3238,7 +3242,7 @@ fill_always_executed_in_1 (class loop *loop, sbitmap
>>>>> @@ contains_call)
>>>>>     
>>>>>        if (bb->loop_father->header == bb)
>>>>>               {
>>>>> -       if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>>> +       if (!dominated_by_p (CDI_DOMINATORS, bb->loop_father->latch,
>>>>> bb))
>>>>>       break;
>>>>
>>>> That's now a always false condition - a loops latch is always dominated
>>>> by its header.  The condition as written tries to verify whether the
>>>> loop is always entered - mind we visit all blocks, not only those
>>>> always executed.
>>>
>>> Thanks for the catch!  I am afraid the piece of code should be removed since
>>> it stops
>>> search of potential ALWAYS EXECUTED bb after inner loop...
>>
>> But the code says:
>>
>>               /* In a loop that is always entered we may proceed anyway.
>>                  But record that we entered it and stop once we leave it.
>>                */
>>
>> and you do not remove this comment still it doesn't hold anymore
>> after your patch.  I don't say the current code is optimal - I just
>> say it does exactly what is documented.
> 
> Removed.
> 
>>
>>
>>>>
>>>> In fact for your testcase the x->j ref is _not_ always executed
>>>> since the inner loop is conditional on n > 0.
>>>
>>> Yes.  But I want to move x->k (not x->j) out of loop 1 when l > 0 in
>>> store-motion.
>>
>> Sorry, that wasn't clear.  Yes, I agree the code fails to see always
>> executed blocks after inner loops.  But since the code simply walks
>> all blocks in a loop instead of greedily following edges it has
>> to do that since the inner loop might exit the outer loop as well,
>> sth your change wouldn't honor.  Consider
> 
> This is already handled now, if there is an edge exiting out of outer
> loop directly, it will also early break, I've verified it with your case.
> 
> 
>         FOR_EACH_EDGE (e, ei, bb->succs)
>           {
>             /* If there is an exit from this BB.  */
>             if (!flow_bb_inside_loop_p (loop, e->dest))
>               break;
>>
>>    while (--n)
>>     {
>>       do
>>         {
>>           if (do_exit)
>>             goto out;
>>         }
>>       while (1);
>>       p->x += 1;
>>     }
>> out:;
>>
>> you'll see a CFG where the inner loop exits the outer loop as well.
>>
>> So I'm saying to improve the code you'll likely have to do more.
>> And add a testcase like the following
>>
>> void __attribute__((noipa))
>> foo (int n, int m, int f, int *p, int *q)
>> {
>>    while (--n)
>>     {
>>       do
>>         {
>>           *q += 1;
>>           if (f)
>>             goto out;
>>         }
>>       while (--m);
>>       *p += 1;
>>     }
>> out:;
>> }
>>
>> int main()
>> {
>>     int i = 0;
>>     foo (10, 10, 1, (void *)0, &i);
>>     if (i != 1)
>>       __builtin_abort ();
>>     return 0;
>> }
>>
>> Richard.
>>
> 
> Updated the patch:
> 
> 
> [PATCH v2] Fix incomplete computation in fill_always_executed_in_1
> 
> It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for
> nested loops.  Current design will exit if an inner loop doesn't
> dominate outer loop's latch or exit after exiting from inner loop, which
> caused early return from outer loop, then ALWAYS EXECUTED blocks after
> inner loops is skipped.
> This patch removes the redundant inner loop check, but keep the check of
> exiting to out of outer loop from inner loop, then the store instruction
> in outer loop could also be moved out of outer loop by store motion.

This patch is regression tested pass on Power, and shows NO SPEC2017
performance change.

> 
> gcc/ChangeLog:
> 
>       * tree-ssa-loop-im.c (fill_always_executed_in_1): Remove
>       inn_loop check.
> 
> gcc/testsuite/ChangeLog:
> 
>       * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
>       * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
> ---
>   gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 +++++++++++++++++
>   gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 ++++++++++++++++++++++
>   gcc/tree-ssa-loop-im.c                     | 14 ----------
>   3 files changed, 54 insertions(+), 14 deletions(-)
>   create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>   create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> 
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> new file mode 100644
> index 00000000000..097a5ee4a4b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> @@ -0,0 +1,24 @@
> +/* PR/101293 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +struct X { int i; int j; int k;};
> +
> +void foo(struct X *x, int n, int l)
> +{
> +  for (int j = 0; j < l; j++)
> +    {
> +      for (int i = 0; i < n; ++i)
> +     {
> +       int *p = &x->j;
> +       int tem = *p;
> +       x->j += tem * i;
> +     }
> +      int *r = &x->k;
> +      int tem2 = *r;
> +      x->k += tem2 * j;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> new file mode 100644
> index 00000000000..6e180de528e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> @@ -0,0 +1,30 @@
> +/* PR/101293 */
> +/* { dg-do run } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +void __attribute__ ((noipa)) foo (int n, int m, int f, int *p, int *q)
> +{
> +  while (--n)
> +    {
> +      do
> +     {
> +       *q += 1;
> +       if (f)
> +         goto out;
> +     }
> +      while (--m);
> +      *p += 1;
> +    }
> +out:;
> +}
> +
> +int main()
> +{
> +    int i = 0;
> +    foo (10, 10, 1, (void *) 0, &i);
> +    if (i != 1)
> +      __builtin_abort ();
> +    return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Executing store motion" 1 "lim2" } } */
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index b24bc64f2a7..c44f4390ce2 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -3197,7 +3197,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap 
> contains_call)
>     basic_block bb = NULL, *bbs, last = NULL;
>     unsigned i;
>     edge e;
> -  class loop *inn_loop = loop;
> 
>     if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
>       {
> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap 
> contains_call)
>            to disprove this if possible).  */
>         if (bb->flags & BB_IRREDUCIBLE_LOOP)
>           break;
> -
> -       if (!flow_bb_inside_loop_p (inn_loop, bb))
> -         break;
> -
> -       if (bb->loop_father->header == bb)
> -         {
> -           if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> -             break;
> -
> -           /* In a loop that is always entered we may proceed anyway.
> -              But record that we entered it and stop once we leave it.  */
> -           inn_loop = bb->loop_father;
> -         }
>       }
> 
>         while (1)
> 

-- 
Thanks,
Xionghu
Re: [PATCH v2] Fix incomplete computation in fill_always_executed_in_1

Reply via email to