Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization

Richard Biener Mon, 25 Mar 2019 06:26:48 -0700

On Fri, Mar 22, 2019 at 7:12 AM Bin.Cheng <amker.ch...@gmail.com> wrote:
>
> On Thu, Mar 21, 2019 at 8:24 PM Richard Biener
> <richard.guent...@gmail.com> wrote:
> >
> > On Mon, Dec 18, 2017 at 1:37 PM Richard Biener
> > <richard.guent...@gmail.com> wrote:
> > >
> > > On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> > > > On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener
> > > > <richard.guent...@gmail.com> wrote:
> > > >> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng <amker.ch...@gmail.com> 
> > > >> wrote:
> > > >>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng <amker.ch...@gmail.com> 
> > > >>> wrote:
> > > >>>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener
> > > >>>> <richard.guent...@gmail.com> wrote:
> > > >>>>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <bin.ch...@arm.com> 
> > > >>>>> wrote:
> > > >>>>>> Hi,
> > > >>>>>> As explained in the PR, given below test case:
> > > >>>>>> int a[8][10] = { [2][5] = 4 }, c;
> > > >>>>>>
> > > >>>>>> int
> > > >>>>>> main ()
> > > >>>>>> {
> > > >>>>>>   short b;
> > > >>>>>>   int i, d;
> > > >>>>>>   for (b = 4; b >= 0; b--)
> > > >>>>>>     for (c = 0; c <= 6; c++)
> > > >>>>>>       a[c + 1][b + 2] = a[c][b + 1];
> > > >>>>>>   for (i = 0; i < 8; i++)
> > > >>>>>>     for (d = 0; d < 10; d++)
> > > >>>>>>       if (a[i][d] != (i == 3 && d == 6) * 4)
> > > >>>>>>         __builtin_abort ();
> > > >>>>>>   return 0;
> > > >>>>>>
> > > >>>>>> the loop nest is illegal for vectorization without reversing inner 
> > > >>>>>> loop.  The issue
> > > >>>>>> is in data dependence checking of vectorizer, I believe the 
> > > >>>>>> mentioned revision just
> > > >>>>>> exposed this.  Previously the vectorization is skipped because of 
> > > >>>>>> unsupported memory
> > > >>>>>> operation.  The outer loop vectorization unrolls the outer loop 
> > > >>>>>> into:
> > > >>>>>>
> > > >>>>>>   for (b = 4; b > 0; b -= 4)
> > > >>>>>>   {
> > > >>>>>>     for (c = 0; c <= 6; c++)
> > > >>>>>>       a[c + 1][6] = a[c][5];
> > > >>>>>>     for (c = 0; c <= 6; c++)
> > > >>>>>>       a[c + 1][5] = a[c][4];
> > > >>>>>>     for (c = 0; c <= 6; c++)
> > > >>>>>>       a[c + 1][4] = a[c][3];
> > > >>>>>>     for (c = 0; c <= 6; c++)
> > > >>>>>>       a[c + 1][3] = a[c][2];
> > > >>>>>>   }
> > > >>>>>> Then four inner loops are fused into:
> > > >>>>>>   for (b = 4; b > 0; b -= 4)
> > > >>>>>>   {
> > > >>>>>>     for (c = 0; c <= 6; c++)
> > > >>>>>>     {
> > > >>>>>>       a[c + 1][6] = a[c][5];  // S1
> > > >>>>>>       a[c + 1][5] = a[c][4];  // S2
> > > >>>>>>       a[c + 1][4] = a[c][3];
> > > >>>>>>       a[c + 1][3] = a[c][2];
> > > >>>>>>     }
> > > >>>>>>   }
> > > >>>>>
> > > >>>>> Note that they are not really "fused" but they are interleaved.  
> > > >>>>> With
> > > >>>>> GIMPLE in mind
> > > >>>>> that makes a difference, you should get the equivalent of
> > > >>>>>
> > > >>>>>    for (c = 0; c <= 6; c++)
> > > >>>>>      {
> > > >>>>>        tem1 = a[c][5];
> > > >>>>>        tem2 = a[c][4];
> > > >>>>>        tem3 = a[c][3];
> > > >>>>>        tem4 = a[c][2];
> > > >>>>>        a[c+1][6] = tem1;
> > > >>>>>        a[c +1][5] = tem2;
> > > >>>>>         a[c+1][4] = tem3;
> > > >>>>>         a[c+1][3] = tem4;
> > > >>>>>      }
> > > >>>> Yeah, I will double check if this abstract breaks the patch and how.
> > > >>> Hmm, I think this doesn't break it, well at least for part of the
> > > >>> analysis, because it is loop carried (backward) dependence goes wrong,
> > > >>> interleaving or not with the same iteration doesn't matter here.
> > > >>
> > > >> I think the idea is that forward dependences are always fine (negative 
> > > >> distance)
> > > >> to vectorize.  But with backward dependences we have to adhere to 
> > > >> max_vf.
> > > >>
> > > >> It looks like for outer loop vectorization we only look at the 
> > > >> distances in the
> > > >> outer loop but never at inner ones.  But here the same applies but 
> > > >> isn't that
> > > >> independend on the distances with respect to the outer loop?
> > > >>
> > > >> But maybe I'm misunderstanding how "distances" work here.
> > > > Hmm, I am not sure I understand "distance" correctly.  With
> > > > description as in book like "Optimizing compilers for Modern
> > > > Architectures", distance is "# of iteration of sink ref - # of
> > > > iteration of source ref".  Given below example:
> > > >   for (i = 0; i < N; ++i)
> > > >     {
> > > >       x = arr[idx_1];  // S1
> > > >       arr[idx_2] = x;  // S2
> > > >     }
> > > > if S1 is source ref, distance = idx_2 - idx_1, and distance > 0.  Also
> > > > this is forward dependence.  For example, idx_1 is i + 1 and idx_2 is
> > > > i;
> > > > If S2 is source ref, distance = idx_1 - idx_2, and distance < 0.  Also
> > > > this is backward dependence.  For example idx_1 is i and idx_2 is i +
> > > > 1;
> > > >
> > > > In GCC, we always try to subtract idx_2 from idx_1 first in computing
> > > > classic distance, we could result in negative distance in case of
> > > > backward dependence.  When this happens at dependence carried level,
> > > > we manually reverse it.  When this happens at inner level loop, we
> > > > simply keep the negative distance.  And it's meaningless to talk about
> > > > forward/backward given dependence is carried at outer level loop.
> > > >
> > > > Outer loop vectorization is interesting.  The problematic case has
> > > > backward dependence carried by outer loop.  Because we don't check
> > > > dist vs. max_vf for such dep, the unrolled references could have outer
> > > > loop index equal to each other, as in a[c][5] and a[c+1][5].  So it's
> > > > like we have outer loop index resolved as equal.  Now it has
> > > > dependence only if c == c' + 1.  I think previous comment on fusion
> > > > still hold for this and we now need to check backward dependence
> > > > between the two reference at inner level loop.  I guess it's a bit
> > > > trick to model dependence between a[c][5] and a[c+1][5] using the
> > > > original references and dependence.  I think it's okay to do that
> > > > because distance of a[c][5] and a[c+1][5] is what we computed
> > > > previously for the original references at inner level loop.
> > > >
> > > > Not sure if I have missed something important.
> > >
> > > Not sure either.  unroll-and-jam does
> > >
> > >       FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
> > >         {
> > >           /* A distance (a,b) is at worst transformed into (a/N,b) by the
> > >              unrolling (factor N), so the transformation is valid if
> > >              a >= N, or b > 0, or b is zero and a > 0.  Otherwise the 
> > > unroll
> > >              factor needs to be limited so that the first condition holds.
> > >              That may limit the factor down to zero in the worst case.  */
> > >           int dist = dist_v[0];
> > >           if (dist < 0)
> > >             gcc_unreachable ();
> > >           else if ((unsigned)dist >= *unroll)
> > >             ;
> > >           else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS 
> > > (ddr) - 1)
> > >                    || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS 
> > > (ddr) - 1)
> > >                        && dist > 0))
> > >             ;
> > >           else
> > >             *unroll = dist;
> > >
> > > where *unroll is similar to *max_vf I think.  dist_v[0] is the innermost 
> > > loop.
> > >
> > > The vectorizer does way more complicated things and only looks at the
> > > distance with respect to the outer loop as far as I can see which can
> > > be negative.
> > >
> > > Not sure if fusion and vectorizer "interleaving" makes a difference here.
> > > I think the idea was that when interleaving stmt-by-stmt then forward
> > > dependences would be preserved and thus we don't need to check the
> > > inner loop dependences.  speaking with "forward vs. backward" dependences
> > > again, not distances...
> > >
> > > This also means that unroll-and-jam could be enhanced to "interleave"
> > > stmts and thus cover more cases?
> > >
> > > Again, I hope Micha can have a look here...
> >
> > Coming back to this...  the vectorizer doesn't look at the inner loop 
> > distances
> > at all at the moment.  Bins orginal patch adds that for the case where the
> > outer loop evolution results in a negative dependence distance.  But what
> > if the outer loop evolution results in a zero distance or a positive 
> > distance
> > within the bounds of a sensible max_vf?  So I think we need to "simply"
> > look at all distance vector components, not just that of the outer loop
> > and perform all regular checks.  That also allows adjustment of max_vf
> > for inner loop dependences in case vectorization with smaller vectors is
> > still possible.
> >
> > Doing that by making loop_depth iterate
> > fixes the testcase (as expected) but also regresses vect-outer-simd-[123].c
> > if we make the code look at ->safelen of the inner loop when processing
> > inner loop distances (because safelen is 0 there).  Just using ->safelen
> > from the outer loop doesn't regress anything it seems.
> >
> > So I am going to test the following attached.  I have added the
> > testcase also with reversed inner loop to verify we'd apply outer loop
> > vectorization to that one.
> >
> > Any comments?
> I kind of think that we are checking for different things in this way.
> IIUC, dependence checks in outer loop vectorizer could be categorized
> into three parts.  The regular vect related check on outer loop;
> dependence check to fuse inner loop iterations; and dependence check
> to shuffle scalar statement/reference together.  To pass the fusion
> check, we only need to make sure no backward dependence is generated
> in the result.  So for case of zero/positive distance in outer loop
> evolution, it won't result in backward dependence?  For example, if we
> adjust the original loop like:
>
>    for (b = 4; b >= 0; b--)
>      for (c = 0; c <= 6; c++)
> -       a[c + 1][b + 2] = a[c][b + 1];
> +      a[c + 1][b + 1] = a[c][b + 2];
>
> The fusion result would be
>    for (b = 4; b > 0; b -= 4)
>    {
>      for (c = 0; c <= 6; c++)
>      {
>        a[c + 1][5] = a[c][6];  // S1
>        a[c + 1][4] = a[c][5];  // S2
>        a[c + 1][3] = a[c][4];
>        a[c + 1][2] = a[c][3];
>      }
>    }
> Though there is dependence between s1/s2, it's forward dependence now.


Hmm, OK.  But this doesn't vectorize with or without the patch, we claim

t.c:9:3: note:   dependence distance  = 1.
t.c:11:29: missed:   not vectorized, possible dependence between
data-refs a[c.3_15][_48] and a[_3][_50]
t.c:9:3: missed:  bad data dependence.
t.c:9:3: missed: couldn't vectorize loop

> Even we need to check the negative distance for inner loop, using the
> regular checks for fusion part for inner loop(s) is a bit strange.
> Using outer loop's safelen here is also unclear.
>
> I am not sure where we do the shuffle related check, is it (dist == 0) case?
> Is that the reason that the patch employs the whole regular checks?

I think the suffling needs to look at the outer loop distance and it is the
== 0 and the forward dependence case where we check/constrain
against the vectorization factor?

I agree that formulating the outer loop dependence testing in a way to
check the unroll-and-jam condition and the shuffling part separately
would be best.  I was mainly worried that your patch only hooks into
the negative "outer" distance case and not the zero or positive
distance one.  For example for unroll-and-jam we constrain the
maximum unroll factor.  I wonder if we should simply call into
its adjust_unroll_factor from the vectorizer or copy&paste it
to the vectorizer.  I'll note that with known dependence distances
it seems to never compute failure...  at least it computes the
correct maximum vectorization factor for me.

I also wonder about dependences for DRs in the outer loop
which we'd even more heavily re-order.

Anyways, we should somehow fix this.

Attached is a patch variant cut&pasting the unroll-and-jam
testing.

Does this look better?

Thanks,
Richard.

> Thanks,
> bin
> >
> > Bootstrap & regtest running on x86_64-unknown-linux-gnu.
> >
> > Richard.
> >
> > 2019-03-21  Richard Biener  <rguent...@suse.de>
> >
> >         PR tree-optimization/81740
> >         * tree-vect-data-refs.c (vect_analyze_data_ref_dependence):
> >         Perform distance vector related dependence checks for all
> >         loops.
> >
> >         * gcc.dg/vect/pr81740-1.c: New testcase.
> >         * gcc.dg/vect/pr81740-2.c: Likewise.

fix-pr81740-2
Description: Binary data

Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization

Reply via email to