On Tue, 24 Aug 2021, Xionghu Luo wrote: > > > On 2021/8/19 20:11, Richard Biener wrote: > >> - class loop *inn_loop = loop; > >> > >> if (ALWAYS_EXECUTED_IN (loop->header) == NULL) > >> { > >> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, > >> sbitmap contains_call) > >> to disprove this if possible). */ > >> if (bb->flags & BB_IRREDUCIBLE_LOOP) > >> break; > >> - > >> - if (!flow_bb_inside_loop_p (inn_loop, bb)) > >> - break; > >> - > >> - if (bb->loop_father->header == bb) > >> - { > >> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >> - break; > >> - > >> - /* In a loop that is always entered we may proceed anyway. > >> - But record that we entered it and stop once we leave it. */ > >> - inn_loop = bb->loop_father; > >> - } > >> } > >> > >> while (1) > > I'm not sure this will work correct (I'm not sure how the existing > > code makes it so either...). That said, I can't poke any hole > > into the change. What I see is that definitely > > > > if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > > last = bb; > > > > if (bitmap_bit_p (contains_call, bb->index)) > > break; > > > > doesn't work reliably since the DOM ordering will process blocks > > A B and C in random order for > > > > for (;;) > > { > > if (cond) > > { > > A: foo (); > > } > > else B:; > > C:; > > } > > > > and thus we can end up setting 'last' to C_before_ processing > > 'A' and thus arriving at the call foo () ... > > > > get_loop_body_in_dom_order does some "special sauce" but not > > to address the above problem - but it might be that a subtle > > issue like the above is the reason for the inner loop handling. > > The inner loop block order does_not_ adhere to this "special sauce", > > that is - the "Additionally, if a basic block s dominates > > the latch, then only blocks dominated by s are be after it." > > guarantee holds for the outer loop latch, not for the inner. > > > > Digging into the history of fill_always_executed_in_1 doesn't > > reveal anything - the inner loop handling has been present > > since introduction by Zdenek - but usually Zdenek has a reason > > for doing things as he does;) > > Yes, this is really complicated usage, thanks for point it out. :) > I constructed two cases to verify this with inner loop includes "If A; else > B; C". > Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also checks > whether the bb domintes outer loop’s latch, if C dominate outer loop’s latch, > C is postponed, the access order is ABC, 'last' won’t be set to C if A or B > contains call;
But it depends on the order of visiting ABC and that's hard to put into a testcase since it depends on the order of edges and the processing of the dominance computation. ABC are simply unordered with respect to a dominator walk. > Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop, > the access order is CAB, but 'last' also won’t be updated to C in > fill_always_executed_in_1 > since there is also dominate check, then if A or B contains call, it could > break > successfully. > > C won't be set to ALWAYS EXECUTED for both circumstance. > > > > > Note it might be simply a measure against quadratic complexity, > > esp. since with your patch we also dive into not always executed > > subloops as you remove the > > > > if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > > break; > > > > check. I suggest to evaluate behavior of the patch on a testcase > > like > > > > void foo (int n, int **k) > > { > > for (int i = 0; i < n; ++i) > > if (k[0][i]) > > for (int j = 0; j < n; ++j) > > if (k[1][j]) > > for (int l = 0; l < n; ++l) > > if (k[2][l]) > > ... > > } > > Theoretically the complexity is changing from L1(bbs) to > L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs), > so fill_always_executed_in_1's execution time is supposed to be increase from > O(n) to O(n2)? The time should depend on loop depth and bb counts. I also > drafted a > test case has 73-depth loop function with 25 no-ipa function copies each > compiled > in lim2 and lim4 dependently. Total execution time of > fill_always_executed_in_1 is > increased from 32ms to 58ms, almost doubled but not quadratic? It's more like n + (n-1) + (n-2) + ... + 1 which is n^2/2 but that's still O(n^2). > It seems reasonable to see compiling time getting longer since most bbs are > checked > more but a MUST to ensure early break correctly in every loop level... > Though loop nodes could be huge, loop depth will never be so large in actual > code? The "in practice" argument is almost always defeated by automatic program generators ;) > > > > I suspect you'll see quadratic behavior with your patch. You > > should be at least able to preserve a check like > > > > /* Do not process not always executed subloops to avoid > > quadratic behavior. */ > > if (bb->loop_father->header == bb > > && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > > break; > > > > which is of course not optimistic for cases like > > > > for (..) > > { > > if (cond) > > for (..) > > x = 1; // this is always executed if the inner loop is finite > > } > > > > but we need to have an eye on the complexity of this function. I would > > have suggested to do greedy visiting of the loop header successors, > > processing further blocks if all entries (ignoring backedges) are > > processed, setting SET_ALWAYS_EXECUTED_IN. When the worklist > > is empty proceed to inner loops as the current code does. For > > bookkeeping simply keep a to-visit-incoming-edges counter per BB. > > Pseudo-code: > > > > bitmap_set_bit (worklist, loop-header-bb); > > while (!bitmap_empty_p (worklist)) > > { > > bb = pop (worklist); > > Need check whether bb dominates latch before SET_ALWAYS_EXECUTED_IN? Ah, sure. > > SET_ALWAYS_EXECUTED_IN (bb, loop); > > if (bitmap_bit_p (contains_call, bb->index)) > > continue; > > FOR_EACH_EDGE (e, ei, bb->succs) > > { > > if (!flow_bb_inside_loop_p (loop, e->dest)) > > continue; > > if (incoming_count[e->dest->index]-- == 0) > > push (worklist, e->dest); > > } > > } > > Sorry I don't fully understand your algorithm. worklist should be > auto_vec<basic_block> don't support bitmap operations? Is incoming_count > the bb's preds count, why need retain it since it it decreased to 0? the worklist is a auto_bitmap, pop () would be bitmap_first_set_bit/clear_bit. One could use a vector with the caveat of eventually adding duplicate members to the worklist. But as said, it's pseudo-code ;) > > > > iterate over inner loops (incoming_count can be retained, > > we just force the inner loop header onto the worklist). > > Is this same to ? > > for (loop = loop->inner; loop; loop = loop->next) > fill_always_executed_in_1 (loop, contains_call) Yes, but I'd simply use for (loop : loops_list (cfun, 0)) which should iterate over loops in PRE order. Richard.