On Mon, 9 Jan 2017, Jakub Jelinek wrote: > On Sat, Jan 07, 2017 at 12:46:26PM +0100, Richard Biener wrote: > > >The following patch tweaks tree-if-conv.c so that when it wants to > > >version > > >an outer loop, it actually transforms: > > > loop1 > > > loop2 > > >into: > > > if (LOOP_VECTORIZED (1, 3)) > > > { > > > loop1 > > > loop2 > > > } > > > else > > > loop3 (copy of loop1) > > > if (LOOP_VECTORIZED (4, 5)) > > > loop4 (copy of loop2) > > > else > > > loop5 (copy of loop4) > > > > Huh. Why isn't the else case equal to the if case for the vectorizer? > > That is, we have the inner loop if-converted And thus for the if case > > either outer or inner loop vectorization should be possible. > > > > So - why doesn't it work that way? > > The patch is a consequence of the r242520 changes: > http://gcc.gnu.org/ml/gcc-patches/2016-11/msg01541.html > Previously, we had > loop1 > loop2 > transformed into: > loop1 > if (LOOP_VECTORIZED (2, 3)) > loop2 > else > loop3 (copy of loop2) > (if actually we find if-conversion of loop2 desirable, there are masked > loads/stores etc.). loop2 is if-converted after the versioning. > This works well if the inner loop is vectorized, doesn't work at all > (prevents it) outer loop vectorization. > Now, with the r242520, it is transformed into: > if (LOOP_VECTORIZED (1, 3)) > { > loop1 > loop2 > } > else > loop3 (copy of loop1) > loop4 (copy of loop2) > if if-conversion thinks outer loop vectorization might be successful. > In this case, loop2 is if-converted. This works well if the outer loop > versioning is subsequently successful, doesn't work at all if it is > unsuccessful (loop2 itself isn't LOOP_VECTORIZED guarded, so when we are > vectorizing, we use loop2 itself as its scalar loop (so it will contain > MASK_LOAD/MASK_STORE etc. that we then can't expand; also, as loop1 isn't > vectorized, LOOP_VECTORIZED (1, 3) is folded into false and thus we > effectively are vectorizing loop2 in dead code, loop3/loop4 will be used > instead (loop3 is marked as dont_vectorize, so we don't try to vectorize it, > loop4 isn't, so might be vectorized, but only if no if-conversion is needed > for it (but tree-if-conversion determined it is needed)). > With my patch, we have instead: > if (LOOP_VECTORIZED (1, 3)) > { > loop1 > loop2 > } > else > loop3 (copy of loop1) > if (LOOP_VECTORIZED (4, 5)) > loop4 (copy of loop2) > else > loop5 (copy of loop4) > loop2 and loop4 are if-converted, so either outer loop vectorization of > loop1 is successful, then we use loop1/loop2 as the vectorized loop > and loop3/loop5 as the corresponding scalar loop, or it is unsuccessful, > then we use non-vectorized loop3, either with successfully vectorized > loop4 as inner loop (loop5 is corresponding scalar_loop, for epilogues, > versioning for alignment etc.), or we fail to vectorize anything and > end up with scalar loop3/loop5.
But that causes even more versioning (plus redundant if-conversion). > Without the patch I've posted, there are some remaining options, but those > will mean big amount of work in the loop manipulation code etc. > > One option is to keep r242520 in, then the problem is that: > 1) we would need to defer folding LOOP_VECTORIZED (1, 3) into false when > outer loop vectorization failed, if there is still possible inner loop > vectorization (not that difficult) Yeah, that's something r242520 missed to address it seems. We've expected this works as expected... heh. Testsuite coverage is low here it seems. > 2) we'd need to use loop4 as the scalar_loop for the vectorization of > loop2, but that loop is not adjacent to the vectorized loop, so we'd need > to somehow transform all the SSA_NAMEs that might be affected by that > different placement (as if all the loop3 PHIs were loop1 PHIs instead, > and deal with the SSA_NAMEs set in loop4 and used outside of loop4 as > if those were those in loop2 instead); this is the hard part I'm not really > enthusiastic to write We use the special scalar_loop always if we have the loop_vectorized guard, right? Which is of course good. And required for masked loads/stores. I see how this looks somewhat unfortunate. > Another option is to revert r242520, and then do something for the outer loop > vectorization. Right now we expect certain fixed form (5 basic blocks in > the outer loop, lots of assumptions about the cfg of that, dunno where > everywhere it is hardcoded). We'd need to allow also 7+ basic block form, > where one of the extra loops is just if LOOP_VECTORIZED (x, y), then there > is if-converted single basic block loop x, and then perhaps many basic > blocks loop y. Lots of the vectorization code expects ->inner to be THE > inner loop, that would no longer be the case, it would be either the scalar > or vector loop, etc. Yeah, this is why we went the r242520 route... > To me my patch at least for GCC7 looks like far less work than both other > option would require. I know it has the drawback that compared to the other > options there is more loop copying that - for this regard, the option to > revert r242520 and deal with it would be the most effective, only a single > loop is versioned, with r242520 and without my patch it is 2 loops versioned > and with my patch 3 loops versioned. Certainly handling r242520 in a different way looks best then (w/o r242520 the followup to always version loops in if-conversion shows a lot of vect.exp regressions). > There is one thing my patch doesn't do but should for efficiency, if loop1 > (outer loop) is not successfully outer-loop vectorized, then we should mark > loop2 (its inner loop) as dont_vectorize if the outer loop has been > LOOP_VECTORIZED guarded. Then the gcc.dg/gomp/pr68128-1.c change > wouldn't be actually needed. Yes. Are you willing to try re-doing r242520? Thanks, Richard.