On Tue, Mar 26, 2019 at 1:56 AM Richard Sandiford <richard.sandif...@arm.com> wrote: > > Richard Biener <richard.guent...@gmail.com> writes: > > On Fri, Mar 22, 2019 at 7:12 AM Bin.Cheng <amker.ch...@gmail.com> wrote: > >> > >> On Thu, Mar 21, 2019 at 8:24 PM Richard Biener > >> <richard.guent...@gmail.com> wrote: > >> > > >> > On Mon, Dec 18, 2017 at 1:37 PM Richard Biener > >> > <richard.guent...@gmail.com> wrote: > >> > > > >> > > On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng <amker.ch...@gmail.com> > >> > > wrote: > >> > > > On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener > >> > > > <richard.guent...@gmail.com> wrote: > >> > > >> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng <amker.ch...@gmail.com> > >> > > >> wrote: > >> > > >>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng > >> > > >>> <amker.ch...@gmail.com> wrote: > >> > > >>>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener > >> > > >>>> <richard.guent...@gmail.com> wrote: > >> > > >>>>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <bin.ch...@arm.com> > >> > > >>>>> wrote: > >> > > >>>>>> Hi, > >> > > >>>>>> As explained in the PR, given below test case: > >> > > >>>>>> int a[8][10] = { [2][5] = 4 }, c; > >> > > >>>>>> > >> > > >>>>>> int > >> > > >>>>>> main () > >> > > >>>>>> { > >> > > >>>>>> short b; > >> > > >>>>>> int i, d; > >> > > >>>>>> for (b = 4; b >= 0; b--) > >> > > >>>>>> for (c = 0; c <= 6; c++) > >> > > >>>>>> a[c + 1][b + 2] = a[c][b + 1]; > >> > > >>>>>> for (i = 0; i < 8; i++) > >> > > >>>>>> for (d = 0; d < 10; d++) > >> > > >>>>>> if (a[i][d] != (i == 3 && d == 6) * 4) > >> > > >>>>>> __builtin_abort (); > >> > > >>>>>> return 0; > >> > > >>>>>> > >> > > >>>>>> the loop nest is illegal for vectorization without reversing > >> > > >>>>>> inner loop. The issue > >> > > >>>>>> is in data dependence checking of vectorizer, I believe the > >> > > >>>>>> mentioned revision just > >> > > >>>>>> exposed this. Previously the vectorization is skipped because > >> > > >>>>>> of unsupported memory > >> > > >>>>>> operation. The outer loop vectorization unrolls the outer loop > >> > > >>>>>> into: > >> > > >>>>>> > >> > > >>>>>> for (b = 4; b > 0; b -= 4) > >> > > >>>>>> { > >> > > >>>>>> for (c = 0; c <= 6; c++) > >> > > >>>>>> a[c + 1][6] = a[c][5]; > >> > > >>>>>> for (c = 0; c <= 6; c++) > >> > > >>>>>> a[c + 1][5] = a[c][4]; > >> > > >>>>>> for (c = 0; c <= 6; c++) > >> > > >>>>>> a[c + 1][4] = a[c][3]; > >> > > >>>>>> for (c = 0; c <= 6; c++) > >> > > >>>>>> a[c + 1][3] = a[c][2]; > >> > > >>>>>> } > >> > > >>>>>> Then four inner loops are fused into: > >> > > >>>>>> for (b = 4; b > 0; b -= 4) > >> > > >>>>>> { > >> > > >>>>>> for (c = 0; c <= 6; c++) > >> > > >>>>>> { > >> > > >>>>>> a[c + 1][6] = a[c][5]; // S1 > >> > > >>>>>> a[c + 1][5] = a[c][4]; // S2 > >> > > >>>>>> a[c + 1][4] = a[c][3]; > >> > > >>>>>> a[c + 1][3] = a[c][2]; > >> > > >>>>>> } > >> > > >>>>>> } > >> > > >>>>> > >> > > >>>>> Note that they are not really "fused" but they are interleaved. > >> > > >>>>> With > >> > > >>>>> GIMPLE in mind > >> > > >>>>> that makes a difference, you should get the equivalent of > >> > > >>>>> > >> > > >>>>> for (c = 0; c <= 6; c++) > >> > > >>>>> { > >> > > >>>>> tem1 = a[c][5]; > >> > > >>>>> tem2 = a[c][4]; > >> > > >>>>> tem3 = a[c][3]; > >> > > >>>>> tem4 = a[c][2]; > >> > > >>>>> a[c+1][6] = tem1; > >> > > >>>>> a[c +1][5] = tem2; > >> > > >>>>> a[c+1][4] = tem3; > >> > > >>>>> a[c+1][3] = tem4; > >> > > >>>>> } > >> > > >>>> Yeah, I will double check if this abstract breaks the patch and > >> > > >>>> how. > >> > > >>> Hmm, I think this doesn't break it, well at least for part of the > >> > > >>> analysis, because it is loop carried (backward) dependence goes > >> > > >>> wrong, > >> > > >>> interleaving or not with the same iteration doesn't matter here. > >> > > >> > >> > > >> I think the idea is that forward dependences are always fine > >> > > >> (negative distance) > >> > > >> to vectorize. But with backward dependences we have to adhere to > >> > > >> max_vf. > >> > > >> > >> > > >> It looks like for outer loop vectorization we only look at the > >> > > >> distances in the > >> > > >> outer loop but never at inner ones. But here the same applies but > >> > > >> isn't that > >> > > >> independend on the distances with respect to the outer loop? > >> > > >> > >> > > >> But maybe I'm misunderstanding how "distances" work here. > >> > > > Hmm, I am not sure I understand "distance" correctly. With > >> > > > description as in book like "Optimizing compilers for Modern > >> > > > Architectures", distance is "# of iteration of sink ref - # of > >> > > > iteration of source ref". Given below example: > >> > > > for (i = 0; i < N; ++i) > >> > > > { > >> > > > x = arr[idx_1]; // S1 > >> > > > arr[idx_2] = x; // S2 > >> > > > } > >> > > > if S1 is source ref, distance = idx_2 - idx_1, and distance > 0. > >> > > > Also > >> > > > this is forward dependence. For example, idx_1 is i + 1 and idx_2 is > >> > > > i; > >> > > > If S2 is source ref, distance = idx_1 - idx_2, and distance < 0. > >> > > > Also > >> > > > this is backward dependence. For example idx_1 is i and idx_2 is i + > >> > > > 1; > >> > > > > >> > > > In GCC, we always try to subtract idx_2 from idx_1 first in computing > >> > > > classic distance, we could result in negative distance in case of > >> > > > backward dependence. When this happens at dependence carried level, > >> > > > we manually reverse it. When this happens at inner level loop, we > >> > > > simply keep the negative distance. And it's meaningless to talk > >> > > > about > >> > > > forward/backward given dependence is carried at outer level loop. > >> > > > > >> > > > Outer loop vectorization is interesting. The problematic case has > >> > > > backward dependence carried by outer loop. Because we don't check > >> > > > dist vs. max_vf for such dep, the unrolled references could have > >> > > > outer > >> > > > loop index equal to each other, as in a[c][5] and a[c+1][5]. So it's > >> > > > like we have outer loop index resolved as equal. Now it has > >> > > > dependence only if c == c' + 1. I think previous comment on fusion > >> > > > still hold for this and we now need to check backward dependence > >> > > > between the two reference at inner level loop. I guess it's a bit > >> > > > trick to model dependence between a[c][5] and a[c+1][5] using the > >> > > > original references and dependence. I think it's okay to do that > >> > > > because distance of a[c][5] and a[c+1][5] is what we computed > >> > > > previously for the original references at inner level loop. > >> > > > > >> > > > Not sure if I have missed something important. > >> > > > >> > > Not sure either. unroll-and-jam does > >> > > > >> > > FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v) > >> > > { > >> > > /* A distance (a,b) is at worst transformed into (a/N,b) by > >> > > the > >> > > unrolling (factor N), so the transformation is valid if > >> > > a >= N, or b > 0, or b is zero and a > 0. Otherwise the > >> > > unroll > >> > > factor needs to be limited so that the first condition > >> > > holds. > >> > > That may limit the factor down to zero in the worst case. > >> > > */ > >> > > int dist = dist_v[0]; > >> > > if (dist < 0) > >> > > gcc_unreachable (); > >> > > else if ((unsigned)dist >= *unroll) > >> > > ; > >> > > else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS > >> > > (ddr) - 1) > >> > > || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS > >> > > (ddr) - 1) > >> > > && dist > 0)) > >> > > ; > >> > > else > >> > > *unroll = dist; > >> > > > >> > > where *unroll is similar to *max_vf I think. dist_v[0] is the > >> > > innermost loop. > >> > > > >> > > The vectorizer does way more complicated things and only looks at the > >> > > distance with respect to the outer loop as far as I can see which can > >> > > be negative. > >> > > > >> > > Not sure if fusion and vectorizer "interleaving" makes a difference > >> > > here. > >> > > I think the idea was that when interleaving stmt-by-stmt then forward > >> > > dependences would be preserved and thus we don't need to check the > >> > > inner loop dependences. speaking with "forward vs. backward" > >> > > dependences > >> > > again, not distances... > >> > > > >> > > This also means that unroll-and-jam could be enhanced to "interleave" > >> > > stmts and thus cover more cases? > >> > > > >> > > Again, I hope Micha can have a look here... > >> > > >> > Coming back to this... the vectorizer doesn't look at the inner loop > >> > distances > >> > at all at the moment. Bins orginal patch adds that for the case where > >> > the > >> > outer loop evolution results in a negative dependence distance. But what > >> > if the outer loop evolution results in a zero distance or a positive > >> > distance > >> > within the bounds of a sensible max_vf? So I think we need to "simply" > >> > look at all distance vector components, not just that of the outer loop > >> > and perform all regular checks. That also allows adjustment of max_vf > >> > for inner loop dependences in case vectorization with smaller vectors is > >> > still possible. > >> > > >> > Doing that by making loop_depth iterate > >> > fixes the testcase (as expected) but also regresses > >> > vect-outer-simd-[123].c > >> > if we make the code look at ->safelen of the inner loop when processing > >> > inner loop distances (because safelen is 0 there). Just using ->safelen > >> > from the outer loop doesn't regress anything it seems. > >> > > >> > So I am going to test the following attached. I have added the > >> > testcase also with reversed inner loop to verify we'd apply outer loop > >> > vectorization to that one. > >> > > >> > Any comments? > >> I kind of think that we are checking for different things in this way. > >> IIUC, dependence checks in outer loop vectorizer could be categorized > >> into three parts. The regular vect related check on outer loop; > >> dependence check to fuse inner loop iterations; and dependence check > >> to shuffle scalar statement/reference together. To pass the fusion > >> check, we only need to make sure no backward dependence is generated > >> in the result. So for case of zero/positive distance in outer loop > >> evolution, it won't result in backward dependence? For example, if we > >> adjust the original loop like: > >> > >> for (b = 4; b >= 0; b--) > >> for (c = 0; c <= 6; c++) > >> - a[c + 1][b + 2] = a[c][b + 1]; > >> + a[c + 1][b + 1] = a[c][b + 2]; > >> > >> The fusion result would be > >> for (b = 4; b > 0; b -= 4) > >> { > >> for (c = 0; c <= 6; c++) > >> { > >> a[c + 1][5] = a[c][6]; // S1 > >> a[c + 1][4] = a[c][5]; // S2 > >> a[c + 1][3] = a[c][4]; > >> a[c + 1][2] = a[c][3]; > >> } > >> } > >> Though there is dependence between s1/s2, it's forward dependence now. > > > > Hmm, OK. But this doesn't vectorize with or without the patch, we claim > > > > t.c:9:3: note: dependence distance = 1. > > t.c:11:29: missed: not vectorized, possible dependence between > > data-refs a[c.3_15][_48] and a[_3][_50] > > t.c:9:3: missed: bad data dependence. > > t.c:9:3: missed: couldn't vectorize loop > > > >> Even we need to check the negative distance for inner loop, using the > >> regular checks for fusion part for inner loop(s) is a bit strange. > >> Using outer loop's safelen here is also unclear. > >> > >> I am not sure where we do the shuffle related check, is it (dist == 0) > >> case? > >> Is that the reason that the patch employs the whole regular checks? > > > > I think the suffling needs to look at the outer loop distance and it is the > > == 0 and the forward dependence case where we check/constrain > > against the vectorization factor? > > > > I agree that formulating the outer loop dependence testing in a way to > > check the unroll-and-jam condition and the shuffling part separately > > would be best. I was mainly worried that your patch only hooks into > > the negative "outer" distance case and not the zero or positive > > distance one. For example for unroll-and-jam we constrain the > > maximum unroll factor. I wonder if we should simply call into > > its adjust_unroll_factor from the vectorizer or copy&paste it > > to the vectorizer. I'll note that with known dependence distances > > it seems to never compute failure... at least it computes the > > correct maximum vectorization factor for me. > > > > I also wonder about dependences for DRs in the outer loop > > which we'd even more heavily re-order. > > > > Anyways, we should somehow fix this. > > > > Attached is a patch variant cut&pasting the unroll-and-jam > > testing. > > > > Does this look better? > > > > Thanks, > > Richard. > > > >> Thanks, > >> bin > >> > > >> > Bootstrap & regtest running on x86_64-unknown-linux-gnu. > >> > > >> > Richard. > >> > > >> > 2019-03-21 Richard Biener <rguent...@suse.de> > >> > > >> > PR tree-optimization/81740 > >> > * tree-vect-data-refs.c (vect_analyze_data_ref_dependence): > >> > Perform distance vector related dependence checks for all > >> > loops. > >> > > >> > * gcc.dg/vect/pr81740-1.c: New testcase. > >> > * gcc.dg/vect/pr81740-2.c: Likewise. > > > > 2019-03-25 Richard Biener <rguent...@suse.de> > > > > PR tree-optimization/81740 > > * tree-vect-data-refs.c (vect_analyze_data_ref_dependence): > > Add explicit unroll-and-jam dependence testing. > > > > * gcc.dg/vect/pr81740-1.c: New testcase. > > * gcc.dg/vect/pr81740-2.c: Likewise. > > > > Index: gcc/tree-vect-data-refs.c > > =================================================================== > > --- gcc/tree-vect-data-refs.c (revision 269908) > > +++ gcc/tree-vect-data-refs.c (working copy) > > @@ -411,6 +411,45 @@ vect_analyze_data_ref_dependence (struct > > } > > > > loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr)); > > + gcc_assert (loop_depth == 0); > > + > > + /* Perform unroll-and-jam test. */ > > + if (nested_in_vect_loop_p (loop, stmtinfo_a)) > > + { > > + FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v) > > + { > > + /* A distance (a,b) is at worst transformed into (a/N,b) by the > > + unrolling (factor N), so the transformation is valid if > > + a >= N, or b > 0, or b is zero and a > 0. Otherwise the unroll > > + factor needs to be limited so that the first condition holds. > > + That may limit the factor down to zero in the worst case. */ > > I think it would be good to say that we're applying a two-stage > check here (unroll-and-jam, then statement-level interleaving). As it > stands, it sounds like proving (a>0 && b==0) || b>0 is enough to prove > that outer-loop vectorisation is valid for any VF, whereas that isn't > true for the interleaving part. That said... > > > + int dist = dist_v[0]; > > + if (dist < 0) > > + gcc_unreachable (); > > + else if ((unsigned)dist >= *max_vf) > > + ; > > + else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - > > 1) > > + || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - > > 1) > > + && dist > 0)) > > + ; > > + else if (dist <= 1) > > + return opt_result::failure_at (stmtinfo_a->stmt, > > + "not vectorized, possible > > dependence" > > + " between data-refs %T and %T\n", > > + DR_REF (dra), DR_REF (drb)); > > + else > > + { > > + /* Dependence distance does not create dependence, as far as > > + vectorization is concerned, in this case. */ > > + if (dump_enabled_p ()) > > + dump_printf_loc (MSG_NOTE, vect_location, > > + "dependence between data-refs %T and %T " > > + "constrains vectorization factor to %d\n", > > + DR_REF (dra), DR_REF (drb), dist); > > + *max_vf = dist; > > + } > > + } > > + } > > ...I guess this is looping back to the original discussion, sorry, but I'm > not sure taking it as a two-stage check really helps. We're going from: > > A: > for i in [0:n]: > for j in [0:n]: > A(i,j) > B(i,j) > > to: > > B: > for i in [0:n:2]: > for j in [0:n]: > A(i,j) > B(i,j) > for j in [0:n]: > A(i+1,j) > B(i+1,j) > > to: > > C: > for i in [0:n:2]: > for j in [0:n]: > A(i,j) > B(i,j) > A(i+1,j) > B(i+1,j) > > to: > > D: > for i in [0:n:2]: > for j in [0:n]: > A(i,j) > A(i+1,j) > B(i,j) > B(i+1,j) > > but since outer loop vectorisation doesn't change the order of > statements for each "i", IMO it's clearer to go directly from B to D > by treating the inner loop as though we were completely unrolling it. > C doesn't seem like a useful midway point. > > I think the only cases the patch is handling on top of existing checks are: > > (a) a<0 && b>0, normalised to a>0 && b<0 with a reverse flag. > (b) a==0 && b==0, which we now reject earlier > > I don't think (b) makes any difference in practice because > (i) we already reject accesses with a zero step and (ii) > without the support for runtime aliasing checks for outer loops, > we're not going to be able to add any extra disambiguation for > things like: > > d[i][j] = ...; > ... = d[i][j]; > > relative to other loop accesses, so there won't be any cases we can > handle here that wouldn't have been propagated earlier. > > So I think we're really left with (a) being the useful case, > which is also the one added by Bin's original patch. > > Based on the "complete unrolling" view, if we number statements as > (i, n), where i is the outer loop iteration and n is a statement number > in the (completely unrolled) loop body, then the original scalar code > executes in lexicographical order while for the vector loop: > > (1) (i,n) executes before (i+ix,n+nx) for all ix>=0, nx>=1, regardless of VF > (2) (i,n) executes before (i+ix,n-nx) for all ix>=VF, nx>=0 > (well, nx unrestricted, but only nx>=0 is useful given (1)) > > So for any kind of dependence between (i,n) and (i+ix,n-nx), ix>=1, nx>=0 > we need to restrict VF to ix so that (2) ensures the right order. > This means that the unnormalised distances of interest are: > > - (ix, -nx), ix>=1, nx>=0 > - (-ix, nx), ix>=1, nx>=0 > > But the second gets normalised to the first, which is actually useful > in this case :-). > > In terms of the existing code, I think that means we want to change > the handling of nested statements (only) to: > > - ignore DDR_REVERSED_P (ddr) > - restrict the main dist > 0 case to when the inner distance is <= 0. > > This should have the side effect of allowing outer-loop vectorisation for: > > void __attribute__ ((noipa)) > f (int a[][N], int b[restrict]) > { > for (int i = N - 1; i-- > 0; ) > for (int j = 0; j < N - 1; ++j) > a[j + 1][i] = a[j][i + 1] + b[i]; > } > > At the moment we reject this, but AFAICT it should be OK. > (We do allow it for s/i + 1/i/, since then the outer distance is 0.)
Can you file an enhancement request so we don't forget? > I think we can also skip the existing dist==0 code for nested statements > since any case that really matters (such as zero steps) should also have > a second dependence distance (1, 0) that we'd handle as above. Doing that > is probably something for stage 1 though. > > Hope at least some of that was vaguely right. ;-) OK, so there seems to be consensus that Bins patch catched all cases so lets go with it. I'm going to re-test & apply it. Richard. > > Thanks, > Richard