Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization

Richard Biener Tue, 26 Mar 2019 01:31:58 -0700

On Tue, Mar 26, 2019 at 1:56 AM Richard Sandiford
<richard.sandif...@arm.com> wrote:
>
> Richard Biener <richard.guent...@gmail.com> writes:
> > On Fri, Mar 22, 2019 at 7:12 AM Bin.Cheng <amker.ch...@gmail.com> wrote:
> >>
> >> On Thu, Mar 21, 2019 at 8:24 PM Richard Biener
> >> <richard.guent...@gmail.com> wrote:
> >> >
> >> > On Mon, Dec 18, 2017 at 1:37 PM Richard Biener
> >> > <richard.guent...@gmail.com> wrote:
> >> > >
> >> > > On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng <amker.ch...@gmail.com> 
> >> > > wrote:
> >> > > > On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener
> >> > > > <richard.guent...@gmail.com> wrote:
> >> > > >> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng <amker.ch...@gmail.com> 
> >> > > >> wrote:
> >> > > >>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng 
> >> > > >>> <amker.ch...@gmail.com> wrote:
> >> > > >>>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener
> >> > > >>>> <richard.guent...@gmail.com> wrote:
> >> > > >>>>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <bin.ch...@arm.com> 
> >> > > >>>>> wrote:
> >> > > >>>>>> Hi,
> >> > > >>>>>> As explained in the PR, given below test case:
> >> > > >>>>>> int a[8][10] = { [2][5] = 4 }, c;
> >> > > >>>>>>
> >> > > >>>>>> int
> >> > > >>>>>> main ()
> >> > > >>>>>> {
> >> > > >>>>>>   short b;
> >> > > >>>>>>   int i, d;
> >> > > >>>>>>   for (b = 4; b >= 0; b--)
> >> > > >>>>>>     for (c = 0; c <= 6; c++)
> >> > > >>>>>>       a[c + 1][b + 2] = a[c][b + 1];
> >> > > >>>>>>   for (i = 0; i < 8; i++)
> >> > > >>>>>>     for (d = 0; d < 10; d++)
> >> > > >>>>>>       if (a[i][d] != (i == 3 && d == 6) * 4)
> >> > > >>>>>>         __builtin_abort ();
> >> > > >>>>>>   return 0;
> >> > > >>>>>>
> >> > > >>>>>> the loop nest is illegal for vectorization without reversing 
> >> > > >>>>>> inner loop.  The issue
> >> > > >>>>>> is in data dependence checking of vectorizer, I believe the 
> >> > > >>>>>> mentioned revision just
> >> > > >>>>>> exposed this.  Previously the vectorization is skipped because 
> >> > > >>>>>> of unsupported memory
> >> > > >>>>>> operation.  The outer loop vectorization unrolls the outer loop 
> >> > > >>>>>> into:
> >> > > >>>>>>
> >> > > >>>>>>   for (b = 4; b > 0; b -= 4)
> >> > > >>>>>>   {
> >> > > >>>>>>     for (c = 0; c <= 6; c++)
> >> > > >>>>>>       a[c + 1][6] = a[c][5];
> >> > > >>>>>>     for (c = 0; c <= 6; c++)
> >> > > >>>>>>       a[c + 1][5] = a[c][4];
> >> > > >>>>>>     for (c = 0; c <= 6; c++)
> >> > > >>>>>>       a[c + 1][4] = a[c][3];
> >> > > >>>>>>     for (c = 0; c <= 6; c++)
> >> > > >>>>>>       a[c + 1][3] = a[c][2];
> >> > > >>>>>>   }
> >> > > >>>>>> Then four inner loops are fused into:
> >> > > >>>>>>   for (b = 4; b > 0; b -= 4)
> >> > > >>>>>>   {
> >> > > >>>>>>     for (c = 0; c <= 6; c++)
> >> > > >>>>>>     {
> >> > > >>>>>>       a[c + 1][6] = a[c][5];  // S1
> >> > > >>>>>>       a[c + 1][5] = a[c][4];  // S2
> >> > > >>>>>>       a[c + 1][4] = a[c][3];
> >> > > >>>>>>       a[c + 1][3] = a[c][2];
> >> > > >>>>>>     }
> >> > > >>>>>>   }
> >> > > >>>>>
> >> > > >>>>> Note that they are not really "fused" but they are interleaved.  
> >> > > >>>>> With
> >> > > >>>>> GIMPLE in mind
> >> > > >>>>> that makes a difference, you should get the equivalent of
> >> > > >>>>>
> >> > > >>>>>    for (c = 0; c <= 6; c++)
> >> > > >>>>>      {
> >> > > >>>>>        tem1 = a[c][5];
> >> > > >>>>>        tem2 = a[c][4];
> >> > > >>>>>        tem3 = a[c][3];
> >> > > >>>>>        tem4 = a[c][2];
> >> > > >>>>>        a[c+1][6] = tem1;
> >> > > >>>>>        a[c +1][5] = tem2;
> >> > > >>>>>         a[c+1][4] = tem3;
> >> > > >>>>>         a[c+1][3] = tem4;
> >> > > >>>>>      }
> >> > > >>>> Yeah, I will double check if this abstract breaks the patch and 
> >> > > >>>> how.
> >> > > >>> Hmm, I think this doesn't break it, well at least for part of the
> >> > > >>> analysis, because it is loop carried (backward) dependence goes 
> >> > > >>> wrong,
> >> > > >>> interleaving or not with the same iteration doesn't matter here.
> >> > > >>
> >> > > >> I think the idea is that forward dependences are always fine 
> >> > > >> (negative distance)
> >> > > >> to vectorize.  But with backward dependences we have to adhere to 
> >> > > >> max_vf.
> >> > > >>
> >> > > >> It looks like for outer loop vectorization we only look at the 
> >> > > >> distances in the
> >> > > >> outer loop but never at inner ones.  But here the same applies but 
> >> > > >> isn't that
> >> > > >> independend on the distances with respect to the outer loop?
> >> > > >>
> >> > > >> But maybe I'm misunderstanding how "distances" work here.
> >> > > > Hmm, I am not sure I understand "distance" correctly.  With
> >> > > > description as in book like "Optimizing compilers for Modern
> >> > > > Architectures", distance is "# of iteration of sink ref - # of
> >> > > > iteration of source ref".  Given below example:
> >> > > >   for (i = 0; i < N; ++i)
> >> > > >     {
> >> > > >       x = arr[idx_1];  // S1
> >> > > >       arr[idx_2] = x;  // S2
> >> > > >     }
> >> > > > if S1 is source ref, distance = idx_2 - idx_1, and distance > 0.  
> >> > > > Also
> >> > > > this is forward dependence.  For example, idx_1 is i + 1 and idx_2 is
> >> > > > i;
> >> > > > If S2 is source ref, distance = idx_1 - idx_2, and distance < 0.  
> >> > > > Also
> >> > > > this is backward dependence.  For example idx_1 is i and idx_2 is i +
> >> > > > 1;
> >> > > >
> >> > > > In GCC, we always try to subtract idx_2 from idx_1 first in computing
> >> > > > classic distance, we could result in negative distance in case of
> >> > > > backward dependence.  When this happens at dependence carried level,
> >> > > > we manually reverse it.  When this happens at inner level loop, we
> >> > > > simply keep the negative distance.  And it's meaningless to talk 
> >> > > > about
> >> > > > forward/backward given dependence is carried at outer level loop.
> >> > > >
> >> > > > Outer loop vectorization is interesting.  The problematic case has
> >> > > > backward dependence carried by outer loop.  Because we don't check
> >> > > > dist vs. max_vf for such dep, the unrolled references could have 
> >> > > > outer
> >> > > > loop index equal to each other, as in a[c][5] and a[c+1][5].  So it's
> >> > > > like we have outer loop index resolved as equal.  Now it has
> >> > > > dependence only if c == c' + 1.  I think previous comment on fusion
> >> > > > still hold for this and we now need to check backward dependence
> >> > > > between the two reference at inner level loop.  I guess it's a bit
> >> > > > trick to model dependence between a[c][5] and a[c+1][5] using the
> >> > > > original references and dependence.  I think it's okay to do that
> >> > > > because distance of a[c][5] and a[c+1][5] is what we computed
> >> > > > previously for the original references at inner level loop.
> >> > > >
> >> > > > Not sure if I have missed something important.
> >> > >
> >> > > Not sure either.  unroll-and-jam does
> >> > >
> >> > >       FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
> >> > >         {
> >> > >           /* A distance (a,b) is at worst transformed into (a/N,b) by 
> >> > > the
> >> > >              unrolling (factor N), so the transformation is valid if
> >> > >              a >= N, or b > 0, or b is zero and a > 0.  Otherwise the 
> >> > > unroll
> >> > >              factor needs to be limited so that the first condition 
> >> > > holds.
> >> > >              That may limit the factor down to zero in the worst case. 
> >> > >  */
> >> > >           int dist = dist_v[0];
> >> > >           if (dist < 0)
> >> > >             gcc_unreachable ();
> >> > >           else if ((unsigned)dist >= *unroll)
> >> > >             ;
> >> > >           else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS 
> >> > > (ddr) - 1)
> >> > >                    || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS 
> >> > > (ddr) - 1)
> >> > >                        && dist > 0))
> >> > >             ;
> >> > >           else
> >> > >             *unroll = dist;
> >> > >
> >> > > where *unroll is similar to *max_vf I think.  dist_v[0] is the 
> >> > > innermost loop.
> >> > >
> >> > > The vectorizer does way more complicated things and only looks at the
> >> > > distance with respect to the outer loop as far as I can see which can
> >> > > be negative.
> >> > >
> >> > > Not sure if fusion and vectorizer "interleaving" makes a difference 
> >> > > here.
> >> > > I think the idea was that when interleaving stmt-by-stmt then forward
> >> > > dependences would be preserved and thus we don't need to check the
> >> > > inner loop dependences.  speaking with "forward vs. backward" 
> >> > > dependences
> >> > > again, not distances...
> >> > >
> >> > > This also means that unroll-and-jam could be enhanced to "interleave"
> >> > > stmts and thus cover more cases?
> >> > >
> >> > > Again, I hope Micha can have a look here...
> >> >
> >> > Coming back to this...  the vectorizer doesn't look at the inner loop 
> >> > distances
> >> > at all at the moment.  Bins orginal patch adds that for the case where 
> >> > the
> >> > outer loop evolution results in a negative dependence distance.  But what
> >> > if the outer loop evolution results in a zero distance or a positive 
> >> > distance
> >> > within the bounds of a sensible max_vf?  So I think we need to "simply"
> >> > look at all distance vector components, not just that of the outer loop
> >> > and perform all regular checks.  That also allows adjustment of max_vf
> >> > for inner loop dependences in case vectorization with smaller vectors is
> >> > still possible.
> >> >
> >> > Doing that by making loop_depth iterate
> >> > fixes the testcase (as expected) but also regresses 
> >> > vect-outer-simd-[123].c
> >> > if we make the code look at ->safelen of the inner loop when processing
> >> > inner loop distances (because safelen is 0 there).  Just using ->safelen
> >> > from the outer loop doesn't regress anything it seems.
> >> >
> >> > So I am going to test the following attached.  I have added the
> >> > testcase also with reversed inner loop to verify we'd apply outer loop
> >> > vectorization to that one.
> >> >
> >> > Any comments?
> >> I kind of think that we are checking for different things in this way.
> >> IIUC, dependence checks in outer loop vectorizer could be categorized
> >> into three parts.  The regular vect related check on outer loop;
> >> dependence check to fuse inner loop iterations; and dependence check
> >> to shuffle scalar statement/reference together.  To pass the fusion
> >> check, we only need to make sure no backward dependence is generated
> >> in the result.  So for case of zero/positive distance in outer loop
> >> evolution, it won't result in backward dependence?  For example, if we
> >> adjust the original loop like:
> >>
> >>    for (b = 4; b >= 0; b--)
> >>      for (c = 0; c <= 6; c++)
> >> -       a[c + 1][b + 2] = a[c][b + 1];
> >> +      a[c + 1][b + 1] = a[c][b + 2];
> >>
> >> The fusion result would be
> >>    for (b = 4; b > 0; b -= 4)
> >>    {
> >>      for (c = 0; c <= 6; c++)
> >>      {
> >>        a[c + 1][5] = a[c][6];  // S1
> >>        a[c + 1][4] = a[c][5];  // S2
> >>        a[c + 1][3] = a[c][4];
> >>        a[c + 1][2] = a[c][3];
> >>      }
> >>    }
> >> Though there is dependence between s1/s2, it's forward dependence now.
> >
> > Hmm, OK.  But this doesn't vectorize with or without the patch, we claim
> >
> > t.c:9:3: note:   dependence distance  = 1.
> > t.c:11:29: missed:   not vectorized, possible dependence between
> > data-refs a[c.3_15][_48] and a[_3][_50]
> > t.c:9:3: missed:  bad data dependence.
> > t.c:9:3: missed: couldn't vectorize loop
> >
> >> Even we need to check the negative distance for inner loop, using the
> >> regular checks for fusion part for inner loop(s) is a bit strange.
> >> Using outer loop's safelen here is also unclear.
> >>
> >> I am not sure where we do the shuffle related check, is it (dist == 0) 
> >> case?
> >> Is that the reason that the patch employs the whole regular checks?
> >
> > I think the suffling needs to look at the outer loop distance and it is the
> > == 0 and the forward dependence case where we check/constrain
> > against the vectorization factor?
> >
> > I agree that formulating the outer loop dependence testing in a way to
> > check the unroll-and-jam condition and the shuffling part separately
> > would be best.  I was mainly worried that your patch only hooks into
> > the negative "outer" distance case and not the zero or positive
> > distance one.  For example for unroll-and-jam we constrain the
> > maximum unroll factor.  I wonder if we should simply call into
> > its adjust_unroll_factor from the vectorizer or copy&paste it
> > to the vectorizer.  I'll note that with known dependence distances
> > it seems to never compute failure...  at least it computes the
> > correct maximum vectorization factor for me.
> >
> > I also wonder about dependences for DRs in the outer loop
> > which we'd even more heavily re-order.
> >
> > Anyways, we should somehow fix this.
> >
> > Attached is a patch variant cut&pasting the unroll-and-jam
> > testing.
> >
> > Does this look better?
> >
> > Thanks,
> > Richard.
> >
> >> Thanks,
> >> bin
> >> >
> >> > Bootstrap & regtest running on x86_64-unknown-linux-gnu.
> >> >
> >> > Richard.
> >> >
> >> > 2019-03-21  Richard Biener  <rguent...@suse.de>
> >> >
> >> >         PR tree-optimization/81740
> >> >         * tree-vect-data-refs.c (vect_analyze_data_ref_dependence):
> >> >         Perform distance vector related dependence checks for all
> >> >         loops.
> >> >
> >> >         * gcc.dg/vect/pr81740-1.c: New testcase.
> >> >         * gcc.dg/vect/pr81740-2.c: Likewise.
> >
> > 2019-03-25  Richard Biener  <rguent...@suse.de>
> >
> >         PR tree-optimization/81740
> >       * tree-vect-data-refs.c (vect_analyze_data_ref_dependence):
> >       Add explicit unroll-and-jam dependence testing.
> >
> >       * gcc.dg/vect/pr81740-1.c: New testcase.
> >       * gcc.dg/vect/pr81740-2.c: Likewise.
> >
> > Index: gcc/tree-vect-data-refs.c
> > ===================================================================
> > --- gcc/tree-vect-data-refs.c (revision 269908)
> > +++ gcc/tree-vect-data-refs.c (working copy)
> > @@ -411,6 +411,45 @@ vect_analyze_data_ref_dependence (struct
> >      }
> >
> >    loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr));
> > +  gcc_assert (loop_depth == 0);
> > +
> > +  /* Perform unroll-and-jam test.  */
> > +  if (nested_in_vect_loop_p (loop, stmtinfo_a))
> > +    {
> > +      FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
> > +     {
> > +       /* A distance (a,b) is at worst transformed into (a/N,b) by the
> > +          unrolling (factor N), so the transformation is valid if
> > +          a >= N, or b > 0, or b is zero and a > 0.  Otherwise the unroll
> > +          factor needs to be limited so that the first condition holds.
> > +          That may limit the factor down to zero in the worst case.  */
>
> I think it would be good to say that we're applying a two-stage
> check here (unroll-and-jam, then statement-level interleaving).  As it
> stands, it sounds like proving (a>0 && b==0) || b>0 is enough to prove
> that outer-loop vectorisation is valid for any VF, whereas that isn't
> true for the interleaving part.  That said...
>
> > +       int dist = dist_v[0];
> > +       if (dist < 0)
> > +         gcc_unreachable ();
> > +       else if ((unsigned)dist >= *max_vf)
> > +         ;
> > +       else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 
> > 1)
> > +                || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 
> > 1)
> > +                    && dist > 0))
> > +         ;
> > +       else if (dist <= 1)
> > +         return opt_result::failure_at (stmtinfo_a->stmt,
> > +                                        "not vectorized, possible 
> > dependence"
> > +                                        " between data-refs %T and %T\n",
> > +                                        DR_REF (dra), DR_REF (drb));
> > +       else
> > +         {
> > +           /* Dependence distance does not create dependence, as far as
> > +              vectorization is concerned, in this case.  */
> > +           if (dump_enabled_p ())
> > +             dump_printf_loc (MSG_NOTE, vect_location,
> > +                              "dependence between data-refs %T and %T "
> > +                              "constrains vectorization factor to %d\n",
> > +                              DR_REF (dra), DR_REF (drb), dist);
> > +           *max_vf = dist;
> > +         }
> > +     }
> > +    }
>
> ...I guess this is looping back to the original discussion, sorry, but I'm
> not sure taking it as a two-stage check really helps.  We're going from:
>
> A:
>    for i in [0:n]:
>      for j in [0:n]:
>        A(i,j)
>        B(i,j)
>
> to:
>
> B:
>    for i in [0:n:2]:
>      for j in [0:n]:
>        A(i,j)
>        B(i,j)
>      for j in [0:n]:
>        A(i+1,j)
>        B(i+1,j)
>
> to:
>
> C:
>    for i in [0:n:2]:
>      for j in [0:n]:
>        A(i,j)
>        B(i,j)
>        A(i+1,j)
>        B(i+1,j)
>
> to:
>
> D:
>    for i in [0:n:2]:
>      for j in [0:n]:
>        A(i,j)
>        A(i+1,j)
>        B(i,j)
>        B(i+1,j)
>
> but since outer loop vectorisation doesn't change the order of
> statements for each "i", IMO it's clearer to go directly from B to D
> by treating the inner loop as though we were completely unrolling it.
> C doesn't seem like a useful midway point.
>
> I think the only cases the patch is handling on top of existing checks are:
>
> (a) a<0 && b>0, normalised to a>0 && b<0 with a reverse flag.
> (b) a==0 && b==0, which we now reject earlier
>
> I don't think (b) makes any difference in practice because
> (i) we already reject accesses with a zero step and (ii)
> without the support for runtime aliasing checks for outer loops,
> we're not going to be able to add any extra disambiguation for
> things like:
>
>     d[i][j] = ...;
>     ... = d[i][j];
>
> relative to other loop accesses, so there won't be any cases we can
> handle here that wouldn't have been propagated earlier.
>
> So I think we're really left with (a) being the useful case,
> which is also the one added by Bin's original patch.
>
> Based on the "complete unrolling" view, if we number statements as
> (i, n), where i is the outer loop iteration and n is a statement number
> in the (completely unrolled) loop body, then the original scalar code
> executes in lexicographical order while for the vector loop:
>
> (1) (i,n) executes before (i+ix,n+nx) for all ix>=0, nx>=1, regardless of VF
> (2) (i,n) executes before (i+ix,n-nx) for all ix>=VF, nx>=0
>       (well, nx unrestricted, but only nx>=0 is useful given (1))
>
> So for any kind of dependence between (i,n) and (i+ix,n-nx), ix>=1, nx>=0
> we need to restrict VF to ix so that (2) ensures the right order.
> This means that the unnormalised distances of interest are:
>
> - (ix, -nx), ix>=1, nx>=0
> - (-ix, nx), ix>=1, nx>=0
>
> But the second gets normalised to the first, which is actually useful
> in this case :-).
>
> In terms of the existing code, I think that means we want to change
> the handling of nested statements (only) to:
>
> - ignore DDR_REVERSED_P (ddr)
> - restrict the main dist > 0 case to when the inner distance is <= 0.
>
> This should have the side effect of allowing outer-loop vectorisation for:
>
> void __attribute__ ((noipa))
> f (int a[][N], int b[restrict])
> {
>   for (int i = N - 1; i-- > 0; )
>     for (int j = 0; j < N - 1; ++j)
>       a[j + 1][i] = a[j][i + 1] + b[i];
> }
>
> At the moment we reject this, but AFAICT it should be OK.
> (We do allow it for s/i + 1/i/, since then the outer distance is 0.)


Can you file an enhancement request so we don't forget?

> I think we can also skip the existing dist==0 code for nested statements
> since any case that really matters (such as zero steps) should also have
> a second dependence distance (1, 0) that we'd handle as above.  Doing that
> is probably something for stage 1 though.
>
> Hope at least some of that was vaguely right. ;-)

OK, so there seems to be consensus that Bins patch catched all
cases so lets go with it.

I'm going to re-test & apply it.

Richard.

>
> Thanks,
> Richard

Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization

Reply via email to