Re: [PATCH v2] Fix incomplete computation in fill_always_executed_in_1

Xionghu Luo via Gcc-patches Tue, 24 Aug 2021 00:45:48 -0700

On 2021/8/19 20:11, Richard Biener wrote:
>> -  class loop *inn_loop = loop;
>>   
>>     if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
>>       {
>> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap 
>> contains_call)
>>           to disprove this if possible).  */
>>        if (bb->flags & BB_IRREDUCIBLE_LOOP)
>>          break;
>> -
>> -      if (!flow_bb_inside_loop_p (inn_loop, bb))
>> -        break;
>> -
>> -      if (bb->loop_father->header == bb)
>> -        {
>> -          if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>> -            break;
>> -
>> -          /* In a loop that is always entered we may proceed anyway.
>> -             But record that we entered it and stop once we leave it.  */
>> -          inn_loop = bb->loop_father;
>> -        }
>>      }
>>   
>>         while (1)
> I'm not sure this will work correct (I'm not sure how the existing
> code makes it so either...).  That said, I can't poke any hole
> into the change.  What I see is that definitely
> 
>            if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>              last = bb;
> 
>            if (bitmap_bit_p (contains_call, bb->index))
>              break;
> 
> doesn't work reliably since the DOM ordering will process blocks
> A B and C in random order for
> 
>    for (;;)
>     {
>        if (cond)
>          {
>            A: foo ();
>          }
>        else B:;
>        C:;
>     }
> 
> and thus we can end up setting 'last' to C_before_  processing
> 'A' and thus arriving at the call foo () ...
> 
> get_loop_body_in_dom_order does some "special sauce" but not
> to address the above problem - but it might be that a subtle
> issue like the above is the reason for the inner loop handling.
> The inner loop block order does_not_  adhere to this "special sauce",
> that is - the "Additionally, if a basic block s dominates
> the latch, then only blocks dominated by s are be after it."
> guarantee holds for the outer loop latch, not for the inner.
> 
> Digging into the history of fill_always_executed_in_1 doesn't
> reveal anything - the inner loop handling has been present
> since introduction by Zdenek - but usually Zdenek has a reason
> for doing things as he does;)

Yes, this is really complicated usage, thanks for point it out. :)
I constructed two cases to verify this with inner loop includes "If A; else B; 
C". 
Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also checks
whether the bb domintes outer loop’s latch, if C dominate outer loop’s latch,
C is postponed, the access order is ABC, 'last' won’t be set to C if A or B 
contains call;

Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop,
the access order is CAB, but 'last' also won’t be updated to C in 
fill_always_executed_in_1
since there is also dominate check, then if A or B contains call, it could break
successfully. 

C won't be set to ALWAYS EXECUTED for both circumstance.

> 
> Note it might be simply a measure against quadratic complexity,
> esp. since with your patch we also dive into not always executed
> subloops as you remove the
> 
>                if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>                  break;
> 
> check.  I suggest to evaluate behavior of the patch on a testcase
> like
> 
> void foo (int n, int **k)
> {
>    for (int i = 0; i < n; ++i)
>      if (k[0][i])
>        for (int j = 0; j < n; ++j)
>          if (k[1][j])
>            for (int l = 0; l < n; ++l)
>              if (k[2][l])
>                ...
> }

Theoretically the complexity is changing from L1(bbs) to 
L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs),
so fill_always_executed_in_1's execution time is supposed to be increase from
O(n) to O(n2)?  The time should depend on loop depth and bb counts.   I also 
drafted a
test case has 73-depth loop function with 25 no-ipa function copies each 
compiled
in lim2 and lim4 dependently.  Total execution time of 
fill_always_executed_in_1 is
increased from 32ms to 58ms, almost doubled but not quadratic?

It seems reasonable to see compiling time getting longer since most bbs are 
checked
more but a MUST to ensure early break correctly in every loop level... 
Though loop nodes could be huge, loop depth will never be so large in actual 
code?

>  
> I suspect you'll see quadratic behavior with your patch.  You
> should be at least able to preserve a check like
> 
>            /* Do not process not always executed subloops to avoid
>               quadratic behavior.  */
>            if (bb->loop_father->header == bb
>                && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>              break;
> 
> which is of course not optimistic for cases like
> 
>    for (..)
>     {
>       if (cond)
>         for (..)
>       x = 1; // this is always executed if the inner loop is finite
>     }
> 
> but we need to have an eye on the complexity of this function.  I would
> have suggested to do greedy visiting of the loop header successors,
> processing further blocks if all entries (ignoring backedges) are
> processed, setting SET_ALWAYS_EXECUTED_IN.  When the worklist
> is empty proceed to inner loops as the current code does.  For
> bookkeeping simply keep a to-visit-incoming-edges counter per BB.
> Pseudo-code:
> 
>    bitmap_set_bit (worklist, loop-header-bb);
>    while (!bitmap_empty_p (worklist))
>      {
>        bb = pop (worklist);

Need check whether bb dominates latch before SET_ALWAYS_EXECUTED_IN?

>        SET_ALWAYS_EXECUTED_IN (bb, loop);
>        if (bitmap_bit_p (contains_call, bb->index))
>          continue;
>        FOR_EACH_EDGE (e, ei, bb->succs)
>          {
>            if (!flow_bb_inside_loop_p (loop, e->dest))
>              continue;
>            if (incoming_count[e->dest->index]-- == 0)
>              push (worklist, e->dest);
>          }
>      }

Sorry I don't fully understand your algorithm.  worklist should be
auto_vec<basic_block> don't support bitmap operations?  Is incoming_count
the bb's preds count, why need retain it since it it decreased to 0?

>    
>    iterate over inner loops (incoming_count can be retained,
>    we just force the inner loop header onto the worklist).

Is this same to ?

  for (loop = loop->inner; loop; loop = loop->next)
    fill_always_executed_in_1 (loop, contains_call)

> 
> that would avoid the quadraticness with respect to
> get_loop_body_in_dom_order and also fix the dominator issue
> mentioned above.
> 
> Richard.

-- 
Thanks,
Xionghu
Re: [PATCH v2] Fix incomplete computation in fill_always_executed_in_1

Reply via email to