On Mon, 9 Jan 2017, Jakub Jelinek wrote:

> On Mon, Jan 09, 2017 at 10:08:24AM +0100, Richard Biener wrote:
> > > if if-conversion thinks outer loop vectorization might be successful.
> > > In this case, loop2 is if-converted.  This works well if the outer loop
> > > versioning is subsequently successful, doesn't work at all if it is
> > > unsuccessful (loop2 itself isn't LOOP_VECTORIZED guarded, so when we are
> > > vectorizing, we use loop2 itself as its scalar loop (so it will contain
> > > MASK_LOAD/MASK_STORE etc. that we then can't expand; also, as loop1 isn't
> > > vectorized, LOOP_VECTORIZED (1, 3) is folded into false and thus we
> > > effectively are vectorizing loop2 in dead code, loop3/loop4 will be used
> > > instead (loop3 is marked as dont_vectorize, so we don't try to vectorize 
> > > it,
> > > loop4 isn't, so might be vectorized, but only if no if-conversion is 
> > > needed
> > > for it (but tree-if-conversion determined it is needed)).
> > > With my patch, we have instead:
> > >         if (LOOP_VECTORIZED (1, 3))
> > >           {
> > >             loop1
> > >               loop2
> > >           }
> > >         else
> > >           loop3 (copy of loop1)
> > >             if (LOOP_VECTORIZED (4, 5))
> > >               loop4 (copy of loop2)
> > >             else
> > >               loop5 (copy of loop4)
> > > loop2 and loop4 are if-converted, so either outer loop vectorization of
> > > loop1 is successful, then we use loop1/loop2 as the vectorized loop
> > > and loop3/loop5 as the corresponding scalar loop, or it is unsuccessful,
> > > then we use non-vectorized loop3, either with successfully vectorized
> > > loop4 as inner loop (loop5 is corresponding scalar_loop, for epilogues,
> > > versioning for alignment etc.), or we fail to vectorize anything and
> > > end up with scalar loop3/loop5.
> > 
> > But that causes even more versioning (plus redundant if-conversion).
> 
> Yes, one more loop (i.e. 2->3 loops versioned).
> 
> > > One option is to keep r242520 in, then the problem is that:
> > > 1) we would need to defer folding LOOP_VECTORIZED (1, 3) into false when
> > > outer loop vectorization failed, if there is still possible inner loop
> > > vectorization (not that difficult)
> > 
> > Yeah, that's something r242520 missed to address it seems.  We've
> > expected this works as expected...  heh.  Testsuite coverage is low
> > here it seems.
> 
> The insufficient testsuite coverage is what is the bigest problem I think.
> As I said, this part isn't that hard and I could do it.
> 
> > > 2) we'd need to use loop4 as the scalar_loop for the vectorization of
> > > loop2, but that loop is not adjacent to the vectorized loop, so we'd need
> > > to somehow transform all the SSA_NAMEs that might be affected by that
> > > different placement (as if all the loop3 PHIs were loop1 PHIs instead,
> > > and deal with the SSA_NAMEs set in loop4 and used outside of loop4 as
> > > if those were those in loop2 instead); this is the hard part I'm not 
> > > really
> > > enthusiastic to write
> > 
> > We use the special scalar_loop always if we have the loop_vectorized
> > guard, right?  Which is of course good.  And required for masked 
> > loads/stores.
> > 
> > I see how this looks somewhat unfortunate.
> 
> Yes, the scalar loop is used in many places, every time we need something
> that will not be vectorized, we use that (because, we don't have and don't
> want scalar MASK_LOAD/MASK_STORE, or without -ftree-loop-if-convert now also
> the COND_EXPRs all around).
> 
> > Certainly handling r242520 in a different way looks best then (w/o
> > r242520 the followup to always version loops in if-conversion shows
> > a lot of vect.exp regressions).
> > 
> > > There is one thing my patch doesn't do but should for efficiency, if loop1
> > > (outer loop) is not successfully outer-loop vectorized, then we should 
> > > mark
> > > loop2 (its inner loop) as dont_vectorize if the outer loop has been
> > > LOOP_VECTORIZED guarded.  Then the gcc.dg/gomp/pr68128-1.c change
> > > wouldn't be actually needed.

(*)

> > Yes.
> > 
> > Are you willing to try re-doing r242520?
> 
> Given the amount of time it took me to debug even this version of the patch
> (wasted more than a day on the tree-vect-loop-manip.c change, the ICEs on
> pr71077 testcase looked very cryptic), I'm afraid
> I don't know the SSA_NAME renaming/vect manipulation good enough not to
> spend a week or more on that and I probably need to spend that time on other
> PRs instead, we are over 50 P1-P3s behind e.g. GCC5 schedule (comparing
> from Jan 8th GCC5 status report).
> If you think you'd be able to handle it faster or if somebody else (Bill
> as the author of r242520) is willing to handle this part, I can help with
> the tree-vectorizer.c part.
> I think the minimum number of functions that need changing is
> slpeel_tree_duplicate_loop_to_edge_cfg
> vect_loop_versioning
> If the tree-vectorizer.c part supplies loop4 as the LOOP_VINFO_SCALAR_LOOP
> for loop2, then the check whether this needs to extra more complicated
> handling could be whether the vectorized loop's outer loop is not equal to
> scalar_loop's outer loop.

Ok, I don't have too many spare cycles either so can you fix (*)?  Then
we can go with the extra versionings for GCC 7 for the moment and if any
of us has enough time to revisit this soon we can.

Richard.

Reply via email to