On Tue, 24 Aug 2021, Xionghu Luo wrote:

> 
> 
> On 2021/8/19 20:11, Richard Biener wrote:
> >> -  class loop *inn_loop = loop;
> >>   
> >>     if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
> >>       {
> >> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, 
> >> sbitmap contains_call)
> >>         to disprove this if possible).  */
> >>      if (bb->flags & BB_IRREDUCIBLE_LOOP)
> >>        break;
> >> -
> >> -    if (!flow_bb_inside_loop_p (inn_loop, bb))
> >> -      break;
> >> -
> >> -    if (bb->loop_father->header == bb)
> >> -      {
> >> -        if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >> -          break;
> >> -
> >> -        /* In a loop that is always entered we may proceed anyway.
> >> -           But record that we entered it and stop once we leave it.  */
> >> -        inn_loop = bb->loop_father;
> >> -      }
> >>    }
> >>   
> >>         while (1)
> > I'm not sure this will work correct (I'm not sure how the existing
> > code makes it so either...).  That said, I can't poke any hole
> > into the change.  What I see is that definitely
> > 
> >            if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >              last = bb;
> > 
> >            if (bitmap_bit_p (contains_call, bb->index))
> >              break;
> > 
> > doesn't work reliably since the DOM ordering will process blocks
> > A B and C in random order for
> > 
> >    for (;;)
> >     {
> >        if (cond)
> >          {
> >            A: foo ();
> >          }
> >        else B:;
> >        C:;
> >     }
> > 
> > and thus we can end up setting 'last' to C_before_  processing
> > 'A' and thus arriving at the call foo () ...
> > 
> > get_loop_body_in_dom_order does some "special sauce" but not
> > to address the above problem - but it might be that a subtle
> > issue like the above is the reason for the inner loop handling.
> > The inner loop block order does_not_  adhere to this "special sauce",
> > that is - the "Additionally, if a basic block s dominates
> > the latch, then only blocks dominated by s are be after it."
> > guarantee holds for the outer loop latch, not for the inner.
> > 
> > Digging into the history of fill_always_executed_in_1 doesn't
> > reveal anything - the inner loop handling has been present
> > since introduction by Zdenek - but usually Zdenek has a reason
> > for doing things as he does;)
> 
> Yes, this is really complicated usage, thanks for point it out. :)
> I constructed two cases to verify this with inner loop includes "If A; else 
> B; C". 
> Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also checks
> whether the bb domintes outer loop’s latch, if C dominate outer loop’s latch,
> C is postponed, the access order is ABC, 'last' won’t be set to C if A or B 
> contains call;

But it depends on the order of visiting ABC and that's hard to put into
a testcase since it depends on the order of edges and the processing
of the dominance computation.  ABC are simply unordered with respect
to a dominator walk.

> Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop,
> the access order is CAB, but 'last' also won’t be updated to C in 
> fill_always_executed_in_1
> since there is also dominate check, then if A or B contains call, it could 
> break
> successfully. 
> 
> C won't be set to ALWAYS EXECUTED for both circumstance.
> 
> > 
> > Note it might be simply a measure against quadratic complexity,
> > esp. since with your patch we also dive into not always executed
> > subloops as you remove the
> > 
> >                if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >                  break;
> > 
> > check.  I suggest to evaluate behavior of the patch on a testcase
> > like
> > 
> > void foo (int n, int **k)
> > {
> >    for (int i = 0; i < n; ++i)
> >      if (k[0][i])
> >        for (int j = 0; j < n; ++j)
> >          if (k[1][j])
> >            for (int l = 0; l < n; ++l)
> >              if (k[2][l])
> >                ...
> > }
> 
> Theoretically the complexity is changing from L1(bbs) to 
> L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs),
> so fill_always_executed_in_1's execution time is supposed to be increase from
> O(n) to O(n2)?  The time should depend on loop depth and bb counts.   I also 
> drafted a
> test case has 73-depth loop function with 25 no-ipa function copies each 
> compiled
> in lim2 and lim4 dependently.  Total execution time of 
> fill_always_executed_in_1 is
> increased from 32ms to 58ms, almost doubled but not quadratic?

It's more like n + (n-1) + (n-2) + ... + 1 which is n^2/2 but that's still
O(n^2).

> It seems reasonable to see compiling time getting longer since most bbs are 
> checked
> more but a MUST to ensure early break correctly in every loop level... 
> Though loop nodes could be huge, loop depth will never be so large in actual 
> code?

The "in practice" argument is almost always defeated by automatic
program generators ;)

> >  
> > I suspect you'll see quadratic behavior with your patch.  You
> > should be at least able to preserve a check like
> > 
> >            /* Do not process not always executed subloops to avoid
> >               quadratic behavior.  */
> >            if (bb->loop_father->header == bb
> >                && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >              break;
> > 
> > which is of course not optimistic for cases like
> > 
> >    for (..)
> >     {
> >       if (cond)
> >         for (..)
> >       x = 1; // this is always executed if the inner loop is finite
> >     }
> > 
> > but we need to have an eye on the complexity of this function.  I would
> > have suggested to do greedy visiting of the loop header successors,
> > processing further blocks if all entries (ignoring backedges) are
> > processed, setting SET_ALWAYS_EXECUTED_IN.  When the worklist
> > is empty proceed to inner loops as the current code does.  For
> > bookkeeping simply keep a to-visit-incoming-edges counter per BB.
> > Pseudo-code:
> > 
> >    bitmap_set_bit (worklist, loop-header-bb);
> >    while (!bitmap_empty_p (worklist))
> >      {
> >        bb = pop (worklist);
> 
> Need check whether bb dominates latch before SET_ALWAYS_EXECUTED_IN?

Ah, sure.

> >        SET_ALWAYS_EXECUTED_IN (bb, loop);
> >        if (bitmap_bit_p (contains_call, bb->index))
> >          continue;
> >        FOR_EACH_EDGE (e, ei, bb->succs)
> >          {
> >            if (!flow_bb_inside_loop_p (loop, e->dest))
> >              continue;
> >            if (incoming_count[e->dest->index]-- == 0)
> >              push (worklist, e->dest);
> >          }
> >      }
> 
> Sorry I don't fully understand your algorithm.  worklist should be
> auto_vec<basic_block> don't support bitmap operations?  Is incoming_count
> the bb's preds count, why need retain it since it it decreased to 0?

the worklist is a auto_bitmap, pop () would be 
bitmap_first_set_bit/clear_bit.  One could use a vector with the
caveat of eventually adding duplicate members to the worklist.
But as said, it's pseudo-code ;)

> >    
> >    iterate over inner loops (incoming_count can be retained,
> >    we just force the inner loop header onto the worklist).
> 
> Is this same to ?
> 
>   for (loop = loop->inner; loop; loop = loop->next)
>     fill_always_executed_in_1 (loop, contains_call)

Yes, but I'd simply use for (loop : loops_list (cfun, 0)) which
should iterate over loops in PRE order.

Richard.

Reply via email to