-----Original Message----- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Richard Biener Sent: Tuesday, April 28, 2015 4:12 PM To: Jeff Law Cc: Alan Lawrence; gcc@gcc.gnu.org Subject: Re: dom1 prevents vectorization via partial loop peeling?
On Mon, Apr 27, 2015 at 7:06 PM, Jeff Law <l...@redhat.com> wrote: > On 04/27/2015 10:12 AM, Alan Lawrence wrote: >> >> >> After copyrename3, immediately prior to dom1, the loop body looks like: >> >> <bb 2>: >> >> <bb 3>: >> # i_11 = PHI <i_9(7), 0(2)> >> _5 = a[i_11]; >> _6 = i_11 & _5; >> if (_6 != 0) >> goto <bb 4>; >> else >> goto <bb 5>; >> >> <bb 4>: >> >> <bb 5>: >> # m_2 = PHI <5(4), 4(3)> >> _7 = m_2 * _5; >> b[i_11] = _7; >> i_9 = i_11 + 1; >> if (i_9 != 32) >> goto <bb 7>; >> else >> goto <bb 6>; >> >> <bb 7>: >> goto <bb 3>; >> >> <bb 6>: >> return; >> >> dom1 then peeled part of the first loop iteration, producing: > > Yup. The jump threading code realized that if we traverse the edge > 2->3, then we'll always traverse 3->5. The net effect is like peeling > the first iteration because we copy bb3. The original will be reached > via 7->3 (it, loop iterating), the copy via 2->3' and 3' will have its > conditional removed and will unconditionally transfer control to bb5. > > > > This is a known problem, but we really don't have any good heuristics > for when to do this vs when not to do it. > > >> >> In contrast, a slightly-different testcase: >> >> #define N 32 >> >> int a[N]; >> int b[N]; >> >> int foo () >> { >> for (int i = 0; i < N ; i++) >> { >> int cond = (a[i] & i) ? -1 : 0; // extra variable here >> int m = (cond) ? 5 : 4; >> b[i] = a[i] * m; >> } >> } >> >> after copyrename3, just before dom1, is only slightly different: >> >> <bb 2>: >> >> <bb 3>: >> # i_15 = PHI <i_10(7), 0(2)> >> _6 = a[i_15]; >> _7 = i_15 & _6; >> if (_7 != 0) >> goto <bb 4>; >> else >> goto <bb 6>; >> >> <bb 4>: >> # m_3 = PHI <4(6), 5(3)> >> _8 = m_3 * _6; >> b[i_15] = _8; >> i_10 = i_15 + 1; >> if (i_10 != 32) >> goto <bb 7>; >> else >> goto <bb 5>; >> >> <bb 7>: >> goto <bb 3>; >> >> <bb 5>: >> return; >> >> <bb 6>: >> goto <bb 4>; >> >> with bb6 being out-of-line at the end of the function, rather than >> bb4 falling through from just above bb5. However, this prevents dom1 >> from doing the partial peeling, and dom1 only changes the "goto bb7" >> into a "goto bb3": > > I would have still expected it to thread 2->3, 3->6->4 > > >> >> (1) dom1 should really, in the second case, perform the same partial >> peeling that it does in the first testcase, if that is what it thinks >> is desirable. (Of course, we might want to fix that only later, as >> ATM that'd take us backwards). > > Please a file a BZ. It could be something simple, or we might be > hitting one of Zdenek's heuristics around keeping overall loop structure. > >> >> Alternatively, maybe we don't want dom1 doing that sort of thing (?), >> but I'm inclined to think that if it's doing such optimizations, it's >> for a good reason ;) I guess there'll be other times where we >> *cannot* do partially peeling of later iterations... > > It's an open question -- we never reached any kind of conclusion when > it was last discussed with Zdenek. I think the fundamental issue is > we can't really predict when threading the loop is going to interfere > with later optimizations or not. The heuristics we have are marginal at best. > > The one thought we've never explored was re-rolling that first > iteration back into the loop in the vectorizer. >>Well. In this case we hit >>/* If one of the loop header's edge is an exit edge then do not >> apply if-conversion. */ >>FOR_EACH_EDGE (e, ei, loop->header->succs) >> if (loop_exit_edge_p (loop, e)) >> return false; >>which is simply because even after if-conversion we'll at least end up with a >>non-empty latch block which is what the vectorizer doesn't support. >>DOM rotated the loop into this non-canonical form. Running loop header >>copying again would probably undo this. The creation of empty latches with the path-splitting approach where the back edge node can be copied to the predecessors and the Empty latch can be created with the path-splitting approach I have proposed. This will enable the above scenario of vectorization. Thanks & Regards Ajit Richard. > > Jeff