On Mon, 9 Jan 2017, Jakub Jelinek wrote: > On Mon, Jan 09, 2017 at 10:08:24AM +0100, Richard Biener wrote: > > > if if-conversion thinks outer loop vectorization might be successful. > > > In this case, loop2 is if-converted. This works well if the outer loop > > > versioning is subsequently successful, doesn't work at all if it is > > > unsuccessful (loop2 itself isn't LOOP_VECTORIZED guarded, so when we are > > > vectorizing, we use loop2 itself as its scalar loop (so it will contain > > > MASK_LOAD/MASK_STORE etc. that we then can't expand; also, as loop1 isn't > > > vectorized, LOOP_VECTORIZED (1, 3) is folded into false and thus we > > > effectively are vectorizing loop2 in dead code, loop3/loop4 will be used > > > instead (loop3 is marked as dont_vectorize, so we don't try to vectorize > > > it, > > > loop4 isn't, so might be vectorized, but only if no if-conversion is > > > needed > > > for it (but tree-if-conversion determined it is needed)). > > > With my patch, we have instead: > > > if (LOOP_VECTORIZED (1, 3)) > > > { > > > loop1 > > > loop2 > > > } > > > else > > > loop3 (copy of loop1) > > > if (LOOP_VECTORIZED (4, 5)) > > > loop4 (copy of loop2) > > > else > > > loop5 (copy of loop4) > > > loop2 and loop4 are if-converted, so either outer loop vectorization of > > > loop1 is successful, then we use loop1/loop2 as the vectorized loop > > > and loop3/loop5 as the corresponding scalar loop, or it is unsuccessful, > > > then we use non-vectorized loop3, either with successfully vectorized > > > loop4 as inner loop (loop5 is corresponding scalar_loop, for epilogues, > > > versioning for alignment etc.), or we fail to vectorize anything and > > > end up with scalar loop3/loop5. > > > > But that causes even more versioning (plus redundant if-conversion). > > Yes, one more loop (i.e. 2->3 loops versioned). > > > > One option is to keep r242520 in, then the problem is that: > > > 1) we would need to defer folding LOOP_VECTORIZED (1, 3) into false when > > > outer loop vectorization failed, if there is still possible inner loop > > > vectorization (not that difficult) > > > > Yeah, that's something r242520 missed to address it seems. We've > > expected this works as expected... heh. Testsuite coverage is low > > here it seems. > > The insufficient testsuite coverage is what is the bigest problem I think. > As I said, this part isn't that hard and I could do it. > > > > 2) we'd need to use loop4 as the scalar_loop for the vectorization of > > > loop2, but that loop is not adjacent to the vectorized loop, so we'd need > > > to somehow transform all the SSA_NAMEs that might be affected by that > > > different placement (as if all the loop3 PHIs were loop1 PHIs instead, > > > and deal with the SSA_NAMEs set in loop4 and used outside of loop4 as > > > if those were those in loop2 instead); this is the hard part I'm not > > > really > > > enthusiastic to write > > > > We use the special scalar_loop always if we have the loop_vectorized > > guard, right? Which is of course good. And required for masked > > loads/stores. > > > > I see how this looks somewhat unfortunate. > > Yes, the scalar loop is used in many places, every time we need something > that will not be vectorized, we use that (because, we don't have and don't > want scalar MASK_LOAD/MASK_STORE, or without -ftree-loop-if-convert now also > the COND_EXPRs all around). > > > Certainly handling r242520 in a different way looks best then (w/o > > r242520 the followup to always version loops in if-conversion shows > > a lot of vect.exp regressions). > > > > > There is one thing my patch doesn't do but should for efficiency, if loop1 > > > (outer loop) is not successfully outer-loop vectorized, then we should > > > mark > > > loop2 (its inner loop) as dont_vectorize if the outer loop has been > > > LOOP_VECTORIZED guarded. Then the gcc.dg/gomp/pr68128-1.c change > > > wouldn't be actually needed.
(*) > > Yes. > > > > Are you willing to try re-doing r242520? > > Given the amount of time it took me to debug even this version of the patch > (wasted more than a day on the tree-vect-loop-manip.c change, the ICEs on > pr71077 testcase looked very cryptic), I'm afraid > I don't know the SSA_NAME renaming/vect manipulation good enough not to > spend a week or more on that and I probably need to spend that time on other > PRs instead, we are over 50 P1-P3s behind e.g. GCC5 schedule (comparing > from Jan 8th GCC5 status report). > If you think you'd be able to handle it faster or if somebody else (Bill > as the author of r242520) is willing to handle this part, I can help with > the tree-vectorizer.c part. > I think the minimum number of functions that need changing is > slpeel_tree_duplicate_loop_to_edge_cfg > vect_loop_versioning > If the tree-vectorizer.c part supplies loop4 as the LOOP_VINFO_SCALAR_LOOP > for loop2, then the check whether this needs to extra more complicated > handling could be whether the vectorized loop's outer loop is not equal to > scalar_loop's outer loop. Ok, I don't have too many spare cycles either so can you fix (*)? Then we can go with the extra versionings for GCC 7 for the moment and if any of us has enough time to revisit this soon we can. Richard.