On 2021/8/19 20:11, Richard Biener wrote:
>> - class loop *inn_loop = loop;
>>
>> if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
>> {
>> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap
>> contains_call)
>> to disprove this if possible). */
>> if (bb->flags & BB_IRREDUCIBLE_LOOP)
>> break;
>> -
>> - if (!flow_bb_inside_loop_p (inn_loop, bb))
>> - break;
>> -
>> - if (bb->loop_father->header == bb)
>> - {
>> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>> - break;
>> -
>> - /* In a loop that is always entered we may proceed anyway.
>> - But record that we entered it and stop once we leave it. */
>> - inn_loop = bb->loop_father;
>> - }
>> }
>>
>> while (1)
> I'm not sure this will work correct (I'm not sure how the existing
> code makes it so either...). That said, I can't poke any hole
> into the change. What I see is that definitely
>
> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> last = bb;
>
> if (bitmap_bit_p (contains_call, bb->index))
> break;
>
> doesn't work reliably since the DOM ordering will process blocks
> A B and C in random order for
>
> for (;;)
> {
> if (cond)
> {
> A: foo ();
> }
> else B:;
> C:;
> }
>
> and thus we can end up setting 'last' to C_before_ processing
> 'A' and thus arriving at the call foo () ...
>
> get_loop_body_in_dom_order does some "special sauce" but not
> to address the above problem - but it might be that a subtle
> issue like the above is the reason for the inner loop handling.
> The inner loop block order does_not_ adhere to this "special sauce",
> that is - the "Additionally, if a basic block s dominates
> the latch, then only blocks dominated by s are be after it."
> guarantee holds for the outer loop latch, not for the inner.
>
> Digging into the history of fill_always_executed_in_1 doesn't
> reveal anything - the inner loop handling has been present
> since introduction by Zdenek - but usually Zdenek has a reason
> for doing things as he does;)
Yes, this is really complicated usage, thanks for point it out. :)
I constructed two cases to verify this with inner loop includes "If A; else B;
C".
Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also checks
whether the bb domintes outer loop’s latch, if C dominate outer loop’s latch,
C is postponed, the access order is ABC, 'last' won’t be set to C if A or B
contains call;
Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop,
the access order is CAB, but 'last' also won’t be updated to C in
fill_always_executed_in_1
since there is also dominate check, then if A or B contains call, it could break
successfully.
C won't be set to ALWAYS EXECUTED for both circumstance.
>
> Note it might be simply a measure against quadratic complexity,
> esp. since with your patch we also dive into not always executed
> subloops as you remove the
>
> if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> break;
>
> check. I suggest to evaluate behavior of the patch on a testcase
> like
>
> void foo (int n, int **k)
> {
> for (int i = 0; i < n; ++i)
> if (k[0][i])
> for (int j = 0; j < n; ++j)
> if (k[1][j])
> for (int l = 0; l < n; ++l)
> if (k[2][l])
> ...
> }
Theoretically the complexity is changing from L1(bbs) to
L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs),
so fill_always_executed_in_1's execution time is supposed to be increase from
O(n) to O(n2)? The time should depend on loop depth and bb counts. I also
drafted a
test case has 73-depth loop function with 25 no-ipa function copies each
compiled
in lim2 and lim4 dependently. Total execution time of
fill_always_executed_in_1 is
increased from 32ms to 58ms, almost doubled but not quadratic?
It seems reasonable to see compiling time getting longer since most bbs are
checked
more but a MUST to ensure early break correctly in every loop level...
Though loop nodes could be huge, loop depth will never be so large in actual
code?
>
> I suspect you'll see quadratic behavior with your patch. You
> should be at least able to preserve a check like
>
> /* Do not process not always executed subloops to avoid
> quadratic behavior. */
> if (bb->loop_father->header == bb
> && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> break;
>
> which is of course not optimistic for cases like
>
> for (..)
> {
> if (cond)
> for (..)
> x = 1; // this is always executed if the inner loop is finite
> }
>
> but we need to have an eye on the complexity of this function. I would
> have suggested to do greedy visiting of the loop header successors,
> processing further blocks if all entries (ignoring backedges) are
> processed, setting SET_ALWAYS_EXECUTED_IN. When the worklist
> is empty proceed to inner loops as the current code does. For
> bookkeeping simply keep a to-visit-incoming-edges counter per BB.
> Pseudo-code:
>
> bitmap_set_bit (worklist, loop-header-bb);
> while (!bitmap_empty_p (worklist))
> {
> bb = pop (worklist);
Need check whether bb dominates latch before SET_ALWAYS_EXECUTED_IN?
> SET_ALWAYS_EXECUTED_IN (bb, loop);
> if (bitmap_bit_p (contains_call, bb->index))
> continue;
> FOR_EACH_EDGE (e, ei, bb->succs)
> {
> if (!flow_bb_inside_loop_p (loop, e->dest))
> continue;
> if (incoming_count[e->dest->index]-- == 0)
> push (worklist, e->dest);
> }
> }
Sorry I don't fully understand your algorithm. worklist should be
auto_vec<basic_block> don't support bitmap operations? Is incoming_count
the bb's preds count, why need retain it since it it decreased to 0?
>
> iterate over inner loops (incoming_count can be retained,
> we just force the inner loop header onto the worklist).
Is this same to ?
for (loop = loop->inner; loop; loop = loop->next)
fill_always_executed_in_1 (loop, contains_call)
>
> that would avoid the quadraticness with respect to
> get_loop_body_in_dom_order and also fix the dominator issue
> mentioned above.
>
> Richard.
--
Thanks,
Xionghu