On Tue, 17 Aug 2021, Xionghu Luo wrote: > > > On 2021/8/17 15:12, Richard Biener wrote: > > On Tue, 17 Aug 2021, Xionghu Luo wrote: > > > >> Hi, > >> > >> On 2021/8/16 19:46, Richard Biener wrote: > >>> On Mon, 16 Aug 2021, Xiong Hu Luo wrote: > >>> > >>>> It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for > >>>> nested loops. inn_loop is updated to inner loop, so it need be restored > >>>> when exiting from innermost loop. With this patch, the store instruction > >>>> in outer loop could also be moved out of outer loop by store motion. > >>>> Any comments? Thanks. > >>> > >>>> gcc/ChangeLog: > >>>> > >>>> * tree-ssa-loop-im.c (fill_always_executed_in_1): Restore > >>>> inn_loop when exiting from innermost loop. > >>>> > >>>> gcc/testsuite/ChangeLog: > >>>> > >>>> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. > >>>> --- > >>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++++++++++++++++++ > >>>> gcc/tree-ssa-loop-im.c | 6 +++++- > >>>> 2 files changed, 29 insertions(+), 1 deletion(-) > >>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > >>>> > >>>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > >>>> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > >>>> new file mode 100644 > >>>> index 00000000000..097a5ee4a4b > >>>> --- /dev/null > >>>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > >>>> @@ -0,0 +1,24 @@ > >>>> +/* PR/101293 */ > >>>> +/* { dg-do compile } */ > >>>> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > >>>> + > >>>> +struct X { int i; int j; int k;}; > >>>> + > >>>> +void foo(struct X *x, int n, int l) > >>>> +{ > >>>> + for (int j = 0; j < l; j++) > >>>> + { > >>>> + for (int i = 0; i < n; ++i) > >>>> + { > >>>> + int *p = &x->j; > >>>> + int tem = *p; > >>>> + x->j += tem * i; > >>>> + } > >>>> + int *r = &x->k; > >>>> + int tem2 = *r; > >>>> + x->k += tem2 * j; > >>>> + } > >>>> +} > >>>> + > >>>> +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" > >>>> } } > >>>> */ > >>>> + > >>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > >>>> index b24bc64f2a7..5ca4738b20e 100644 > >>>> --- a/gcc/tree-ssa-loop-im.c > >>>> +++ b/gcc/tree-ssa-loop-im.c > >>>> @@ -3211,6 +3211,10 @@ fill_always_executed_in_1 (class loop *loop, > >>>> sbitmap > >>>> @@ contains_call) > >>>> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >>>> last = bb; > >>>> + if (inn_loop != loop > >>>> + && flow_loop_nested_p (bb->loop_father, inn_loop)) > >>>> + inn_loop = bb->loop_father; > >>>> + > >>> > >>> The comment says > >>> > >>> /* In a loop that is always entered we may proceed anyway. > >>> But record that we entered it and stop once we leave > >>> it. > >>> */ > >>> inn_loop = bb->loop_father; > >>> > >>> and your change would defeat that early return, no? > >> > >> The issue is the search method exits too early when iterating the outer > >> loop. For example of a nested loop, loop 1 includes 5,8,3,10,4,9 > >> and loop2 includes 3,10. Currently, it breaks when bb is 3 as bb 3 > >> doesn't dominate bb 9 of loop 1. But actually, both bb 5 and bb 4 are > >> ALWAYS_EXECUTED for loop 1, so if there are store instructions in bb 4 > >> they won't be processed by store motion again. > >> > >> > >> 5<---- > >> |\ | > >> 8 \ 9 > >> | \ | > >> --->3--->4 > >> | | > >> 10---| > >> > >> > >> SET_ALWAYS_EXECUTED_IN is only set to bb 5 on master code now, with this > >> patch, it will continue search when meet bb 3 until bb 4, then last is > >> updated > >> to bb 4, it will break until exit edge is found at bb 4 by > >> "if (!flow_bb_inside_loop_p (loop, e->dest))". Then the followed loop code > >> will > >> set bb 4 as ALWAYS_EXEUCTED and all it's idoms bb 5. > >> > >> > >> while (1) > >> { > >> SET_ALWAYS_EXECUTED_IN (last, loop); > >> if (last == loop->header) > >> break; > >> last = get_immediate_dominator (CDI_DOMINATORS, last); > >> } > >> > >> After further discussion with Kewen, we found that the inn_loop variable is > >> totally useless and could be removed. > >> > >> > >>> > >>>> if (bitmap_bit_p (contains_call, bb->index)) > >>>> break; > >>>> > >>>> @@ -3238,7 +3242,7 @@ fill_always_executed_in_1 (class loop *loop, > >>>> sbitmap > >>>> @@ contains_call) > >>>> > >>>> if (bb->loop_father->header == bb) > >>>> { > >>>> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >>>> + if (!dominated_by_p (CDI_DOMINATORS, > >>>> bb->loop_father->latch, > >>>> bb)) > >>>> break; > >>> > >>> That's now a always false condition - a loops latch is always dominated > >>> by its header. The condition as written tries to verify whether the > >>> loop is always entered - mind we visit all blocks, not only those > >>> always executed. > >> > >> Thanks for the catch! I am afraid the piece of code should be removed > >> since > >> it stops > >> search of potential ALWAYS EXECUTED bb after inner loop... > > > > But the code says: > > > > /* In a loop that is always entered we may proceed anyway. > > But record that we entered it and stop once we leave it. > > */ > > > > and you do not remove this comment still it doesn't hold anymore > > after your patch. I don't say the current code is optimal - I just > > say it does exactly what is documented. > > Removed. > > > > > > >>> > >>> In fact for your testcase the x->j ref is _not_ always executed > >>> since the inner loop is conditional on n > 0. > >> > >> Yes. But I want to move x->k (not x->j) out of loop 1 when l > 0 in > >> store-motion. > > > > Sorry, that wasn't clear. Yes, I agree the code fails to see always > > executed blocks after inner loops. But since the code simply walks > > all blocks in a loop instead of greedily following edges it has > > to do that since the inner loop might exit the outer loop as well, > > sth your change wouldn't honor. Consider > > This is already handled now, if there is an edge exiting out of outer > loop directly, it will also early break, I've verified it with your case. > > > FOR_EACH_EDGE (e, ei, bb->succs) > { > /* If there is an exit from this BB. */ > if (!flow_bb_inside_loop_p (loop, e->dest)) > break; > > > > while (--n) > > { > > do > > { > > if (do_exit) > > goto out; > > } > > while (1); > > p->x += 1; > > } > > out:; > > > > you'll see a CFG where the inner loop exits the outer loop as well. > > > > So I'm saying to improve the code you'll likely have to do more. > > And add a testcase like the following > > > > void __attribute__((noipa)) > > foo (int n, int m, int f, int *p, int *q) > > { > > while (--n) > > { > > do > > { > > *q += 1; > > if (f) > > goto out; > > } > > while (--m); > > *p += 1; > > } > > out:; > > } > > > > int main() > > { > > int i = 0; > > foo (10, 10, 1, (void *)0, &i); > > if (i != 1) > > __builtin_abort (); > > return 0; > > } > > > > Richard. > > > > Updated the patch: > > > [PATCH v2] Fix incomplete computation in fill_always_executed_in_1 > > It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for > nested loops. Current design will exit if an inner loop doesn't > dominate outer loop's latch or exit after exiting from inner loop, which > caused early return from outer loop, then ALWAYS EXECUTED blocks after > inner loops is skipped. > This patch removes the redundant inner loop check, but keep the check of > exiting to out of outer loop from inner loop, then the store instruction > in outer loop could also be moved out of outer loop by store motion. > > gcc/ChangeLog: > > * tree-ssa-loop-im.c (fill_always_executed_in_1): Remove > inn_loop check. > > gcc/testsuite/ChangeLog: > > * gcc.dg/tree-ssa/ssa-lim-19.c: New test. > * gcc.dg/tree-ssa/ssa-lim-20.c: New test. > --- > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 +++++++++++++++++ > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 ++++++++++++++++++++++ > gcc/tree-ssa-loop-im.c | 14 ---------- > 3 files changed, 54 insertions(+), 14 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > new file mode 100644 > index 00000000000..097a5ee4a4b > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > @@ -0,0 +1,24 @@ > +/* PR/101293 */ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +struct X { int i; int j; int k;}; > + > +void foo(struct X *x, int n, int l) > +{ > + for (int j = 0; j < l; j++) > + { > + for (int i = 0; i < n; ++i) > + { > + int *p = &x->j; > + int tem = *p; > + x->j += tem * i; > + } > + int *r = &x->k; > + int tem2 = *r; > + x->k += tem2 * j; > + } > +} > + > +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */ > + > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > new file mode 100644 > index 00000000000..6e180de528e > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > @@ -0,0 +1,30 @@ > +/* PR/101293 */ > +/* { dg-do run } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +void __attribute__ ((noipa)) foo (int n, int m, int f, int *p, int *q) > +{ > + while (--n) > + { > + do > + { > + *q += 1; > + if (f) > + goto out; > + } > + while (--m); > + *p += 1; > + } > +out:; > +} > + > +int main() > +{ > + int i = 0; > + foo (10, 10, 1, (void *) 0, &i); > + if (i != 1) > + __builtin_abort (); > + return 0; > +} > + > +/* { dg-final { scan-tree-dump-times "Executing store motion" 1 "lim2" } } */ > diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > index b24bc64f2a7..c44f4390ce2 100644 > --- a/gcc/tree-ssa-loop-im.c > +++ b/gcc/tree-ssa-loop-im.c > @@ -3197,7 +3197,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap > contains_call) > basic_block bb = NULL, *bbs, last = NULL; > unsigned i; > edge e; > - class loop *inn_loop = loop; > > if (ALWAYS_EXECUTED_IN (loop->header) == NULL) > { > @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap > contains_call) > to disprove this if possible). */ > if (bb->flags & BB_IRREDUCIBLE_LOOP) > break; > - > - if (!flow_bb_inside_loop_p (inn_loop, bb)) > - break; > - > - if (bb->loop_father->header == bb) > - { > - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > - break; > - > - /* In a loop that is always entered we may proceed anyway. > - But record that we entered it and stop once we leave it. */ > - inn_loop = bb->loop_father; > - } > } > > while (1)
I'm not sure this will work correct (I'm not sure how the existing code makes it so either...). That said, I can't poke any hole into the change. What I see is that definitely if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) last = bb; if (bitmap_bit_p (contains_call, bb->index)) break; doesn't work reliably since the DOM ordering will process blocks A B and C in random order for for (;;) { if (cond) { A: foo (); } else B:; C:; } and thus we can end up setting 'last' to C _before_ processing 'A' and thus arriving at the call foo () ... get_loop_body_in_dom_order does some "special sauce" but not to address the above problem - but it might be that a subtle issue like the above is the reason for the inner loop handling. The inner loop block order does _not_ adhere to this "special sauce", that is - the "Additionally, if a basic block s dominates the latch, then only blocks dominated by s are be after it." guarantee holds for the outer loop latch, not for the inner. Digging into the history of fill_always_executed_in_1 doesn't reveal anything - the inner loop handling has been present since introduction by Zdenek - but usually Zdenek has a reason for doing things as he does ;) Note it might be simply a measure against quadratic complexity, esp. since with your patch we also dive into not always executed subloops as you remove the if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) break; check. I suggest to evaluate behavior of the patch on a testcase like void foo (int n, int **k) { for (int i = 0; i < n; ++i) if (k[0][i]) for (int j = 0; j < n; ++j) if (k[1][j]) for (int l = 0; l < n; ++l) if (k[2][l]) ... } I suspect you'll see quadratic behavior with your patch. You should be at least able to preserve a check like /* Do not process not always executed subloops to avoid quadratic behavior. */ if (bb->loop_father->header == bb && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) break; which is of course not optimistic for cases like for (..) { if (cond) for (..) x = 1; // this is always executed if the inner loop is finite } but we need to have an eye on the complexity of this function. I would have suggested to do greedy visiting of the loop header successors, processing further blocks if all entries (ignoring backedges) are processed, setting SET_ALWAYS_EXECUTED_IN. When the worklist is empty proceed to inner loops as the current code does. For bookkeeping simply keep a to-visit-incoming-edges counter per BB. Pseudo-code: bitmap_set_bit (worklist, loop-header-bb); while (!bitmap_empty_p (worklist)) { bb = pop (worklist); SET_ALWAYS_EXECUTED_IN (bb, loop); if (bitmap_bit_p (contains_call, bb->index)) continue; FOR_EACH_EDGE (e, ei, bb->succs) { if (!flow_bb_inside_loop_p (loop, e->dest)) continue; if (incoming_count[e->dest->index]-- == 0) push (worklist, e->dest); } } iterate over inner loops (incoming_count can be retained, we just force the inner loop header onto the worklist). that would avoid the quadraticness with respect to get_loop_body_in_dom_order and also fix the dominator issue mentioned above. Richard.