On Mon, 9 Jan 2017, Jakub Jelinek wrote:

> On Sat, Jan 07, 2017 at 12:46:26PM +0100, Richard Biener wrote:
> > >The following patch tweaks tree-if-conv.c so that when it wants to
> > >version
> > >an outer loop, it actually transforms:
> > >         loop1
> > >           loop2
> > >into:
> > >         if (LOOP_VECTORIZED (1, 3))
> > >           {
> > >             loop1
> > >               loop2
> > >           }
> > >         else
> > >           loop3 (copy of loop1)
> > >             if (LOOP_VECTORIZED (4, 5))
> > >               loop4 (copy of loop2)
> > >             else
> > >               loop5 (copy of loop4)
> > 
> > Huh.  Why isn't the else case equal to the if case for the vectorizer?  
> > That is, we have the inner loop if-converted And thus for the if case 
> > either outer or inner loop vectorization should be possible.
> > 
> > So - why doesn't it work that way?
> 
> The patch is a consequence of the r242520 changes:
> http://gcc.gnu.org/ml/gcc-patches/2016-11/msg01541.html
> Previously, we had
>       loop1
>         loop2
> transformed into:
>       loop1
>         if (LOOP_VECTORIZED (2, 3))
>           loop2
>         else
>           loop3 (copy of loop2)
> (if actually we find if-conversion of loop2 desirable, there are masked
> loads/stores etc.).  loop2 is if-converted after the versioning.
> This works well if the inner loop is vectorized, doesn't work at all
> (prevents it) outer loop vectorization.
> Now, with the r242520, it is transformed into:
>       if (LOOP_VECTORIZED (1, 3))
>         {
>           loop1
>             loop2
>         }
>       else
>         loop3 (copy of loop1)
>           loop4 (copy of loop2)
> if if-conversion thinks outer loop vectorization might be successful.
> In this case, loop2 is if-converted.  This works well if the outer loop
> versioning is subsequently successful, doesn't work at all if it is
> unsuccessful (loop2 itself isn't LOOP_VECTORIZED guarded, so when we are
> vectorizing, we use loop2 itself as its scalar loop (so it will contain
> MASK_LOAD/MASK_STORE etc. that we then can't expand; also, as loop1 isn't
> vectorized, LOOP_VECTORIZED (1, 3) is folded into false and thus we
> effectively are vectorizing loop2 in dead code, loop3/loop4 will be used
> instead (loop3 is marked as dont_vectorize, so we don't try to vectorize it,
> loop4 isn't, so might be vectorized, but only if no if-conversion is needed
> for it (but tree-if-conversion determined it is needed)).
> With my patch, we have instead:
>             if (LOOP_VECTORIZED (1, 3))
>               {
>                 loop1
>                   loop2
>               }
>             else
>               loop3 (copy of loop1)
>                 if (LOOP_VECTORIZED (4, 5))
>                   loop4 (copy of loop2)
>                 else
>                   loop5 (copy of loop4)
> loop2 and loop4 are if-converted, so either outer loop vectorization of
> loop1 is successful, then we use loop1/loop2 as the vectorized loop
> and loop3/loop5 as the corresponding scalar loop, or it is unsuccessful,
> then we use non-vectorized loop3, either with successfully vectorized
> loop4 as inner loop (loop5 is corresponding scalar_loop, for epilogues,
> versioning for alignment etc.), or we fail to vectorize anything and
> end up with scalar loop3/loop5.

But that causes even more versioning (plus redundant if-conversion).

> Without the patch I've posted, there are some remaining options, but those
> will mean big amount of work in the loop manipulation code etc.
> 
> One option is to keep r242520 in, then the problem is that:
> 1) we would need to defer folding LOOP_VECTORIZED (1, 3) into false when
> outer loop vectorization failed, if there is still possible inner loop
> vectorization (not that difficult)

Yeah, that's something r242520 missed to address it seems.  We've
expected this works as expected...  heh.  Testsuite coverage is low
here it seems.

> 2) we'd need to use loop4 as the scalar_loop for the vectorization of
> loop2, but that loop is not adjacent to the vectorized loop, so we'd need
> to somehow transform all the SSA_NAMEs that might be affected by that
> different placement (as if all the loop3 PHIs were loop1 PHIs instead,
> and deal with the SSA_NAMEs set in loop4 and used outside of loop4 as
> if those were those in loop2 instead); this is the hard part I'm not really
> enthusiastic to write

We use the special scalar_loop always if we have the loop_vectorized
guard, right?  Which is of course good.  And required for masked 
loads/stores.

I see how this looks somewhat unfortunate.

> Another option is to revert r242520, and then do something for the outer loop
> vectorization.  Right now we expect certain fixed form (5 basic blocks in
> the outer loop, lots of assumptions about the cfg of that, dunno where
> everywhere it is hardcoded).  We'd need to allow also 7+ basic block form,
> where one of the extra loops is just if LOOP_VECTORIZED (x, y), then there
> is if-converted single basic block loop x, and then perhaps many basic
> blocks loop y.  Lots of the vectorization code expects ->inner to be THE
> inner loop, that would no longer be the case, it would be either the scalar
> or vector loop, etc.

Yeah, this is why we went the r242520 route...

> To me my patch at least for GCC7 looks like far less work than both other
> option would require.  I know it has the drawback that compared to the other
> options there is more loop copying that - for this regard, the option to
> revert r242520 and deal with it would be the most effective, only a single
> loop is versioned, with r242520 and without my patch it is 2 loops versioned
> and with my patch 3 loops versioned.

Certainly handling r242520 in a different way looks best then (w/o
r242520 the followup to always version loops in if-conversion shows
a lot of vect.exp regressions).

> There is one thing my patch doesn't do but should for efficiency, if loop1
> (outer loop) is not successfully outer-loop vectorized, then we should mark
> loop2 (its inner loop) as dont_vectorize if the outer loop has been
> LOOP_VECTORIZED guarded.  Then the gcc.dg/gomp/pr68128-1.c change
> wouldn't be actually needed.

Yes.

Are you willing to try re-doing r242520?

Thanks,
Richard.

Reply via email to