RE: dom1 prevents vectorization via partial loop peeling?

Ajit Kumar Agarwal Tue, 28 Apr 2015 03:58:08 -0700


-----Original Message-----
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Richard 
Biener
Sent: Tuesday, April 28, 2015 4:12 PM
To: Jeff Law
Cc: Alan Lawrence; gcc@gcc.gnu.org
Subject: Re: dom1 prevents vectorization via partial loop peeling?


On Mon, Apr 27, 2015 at 7:06 PM, Jeff Law <l...@redhat.com> wrote:
> On 04/27/2015 10:12 AM, Alan Lawrence wrote:
>>
>>
>> After copyrename3, immediately prior to dom1, the loop body looks like:
>>
>>    <bb 2>:
>>
>>    <bb 3>:
>>    # i_11 = PHI <i_9(7), 0(2)>
>>    _5 = a[i_11];
>>    _6 = i_11 & _5;
>>    if (_6 != 0)
>>      goto <bb 4>;
>>    else
>>      goto <bb 5>;
>>
>>    <bb 4>:
>>
>>    <bb 5>:
>>    # m_2 = PHI <5(4), 4(3)>
>>    _7 = m_2 * _5;
>>    b[i_11] = _7;
>>    i_9 = i_11 + 1;
>>    if (i_9 != 32)
>>      goto <bb 7>;
>>    else
>>      goto <bb 6>;
>>
>>    <bb 7>:
>>    goto <bb 3>;
>>
>>    <bb 6>:
>>    return;
>>
>> dom1 then peeled part of the first loop iteration, producing:
>
> Yup.  The jump threading code realized that if we traverse the edge 
> 2->3, then we'll always traverse 3->5.  The net effect is like peeling 
> the first iteration because we copy bb3.  The original will be reached 
> via 7->3 (it, loop iterating), the copy via 2->3' and 3' will have its 
> conditional removed and will unconditionally transfer control to bb5.
>
>
>
> This is a known problem, but we really don't have any good heuristics 
> for when to do this vs when not to do it.
>
>
>>
>> In contrast, a slightly-different testcase:
>>
>> #define N 32
>>
>> int a[N];
>> int b[N];
>>
>> int foo ()
>> {
>>    for (int i = 0; i < N ; i++)
>>      {
>>        int cond = (a[i] & i) ? -1 : 0; // extra variable here
>>        int m = (cond) ? 5 : 4;
>>        b[i] = a[i] * m;
>>      }
>> }
>>
>> after copyrename3, just before dom1, is only slightly different:
>>
>>    <bb 2>:
>>
>>    <bb 3>:
>>    # i_15 = PHI <i_10(7), 0(2)>
>>    _6 = a[i_15];
>>    _7 = i_15 & _6;
>>    if (_7 != 0)
>>      goto <bb 4>;
>>    else
>>      goto <bb 6>;
>>
>>    <bb 4>:
>>    # m_3 = PHI <4(6), 5(3)>
>>    _8 = m_3 * _6;
>>    b[i_15] = _8;
>>    i_10 = i_15 + 1;
>>    if (i_10 != 32)
>>      goto <bb 7>;
>>    else
>>      goto <bb 5>;
>>
>>    <bb 7>:
>>    goto <bb 3>;
>>
>>    <bb 5>:
>>    return;
>>
>>    <bb 6>:
>>    goto <bb 4>;
>>
>> with bb6 being out-of-line at the end of the function, rather than 
>> bb4 falling through from just above bb5. However, this prevents dom1 
>> from doing the partial peeling, and dom1 only changes the "goto bb7" 
>> into a "goto bb3":
>
> I would have still expected it to thread 2->3, 3->6->4
>
>
>>
>> (1) dom1 should really, in the second case, perform the same partial 
>> peeling that it does in the first testcase, if that is what it thinks 
>> is desirable. (Of course, we might want to fix that only later, as 
>> ATM that'd take us backwards).
>
> Please a file a BZ.  It could be something simple, or we might be 
> hitting one of Zdenek's heuristics around keeping overall loop structure.
>
>>
>> Alternatively, maybe we don't want dom1 doing that sort of thing (?), 
>> but I'm inclined to think that if it's doing such optimizations, it's 
>> for a good reason ;) I guess there'll be other times where we 
>> *cannot* do partially peeling of later iterations...
>
> It's an open question -- we never reached any kind of conclusion when 
> it was last discussed with Zdenek.  I think the fundamental issue is 
> we can't really predict when threading the loop is going to interfere 
> with later optimizations or not.  The heuristics we have are marginal at best.
>
> The one thought we've never explored was re-rolling that first 
> iteration back into the loop in the vectorizer.

>>Well.  In this case we hit


  >>/* If one of the loop header's edge is an exit edge then do not
    >> apply if-conversion.  */
  >>FOR_EACH_EDGE (e, ei, loop->header->succs)
   >> if (loop_exit_edge_p (loop, e))
     >> return false;

>>which is simply because even after if-conversion we'll at least end up with a 
>>non-empty latch block which is what the vectorizer doesn't support.
>>DOM rotated the loop into this non-canonical form.  Running loop header 
>>copying again would probably undo this.

The creation of empty latches with the path-splitting approach where the back 
edge node can be copied to the predecessors and the 
Empty latch can be created with the path-splitting approach I have proposed. 
This will enable the above scenario of vectorization.

Thanks & Regards
Ajit

Richard.

>
> Jeff

RE: dom1 prevents vectorization via partial loop peeling?

Reply via email to