On Thu, Mar 21, 2019 at 8:24 PM Richard Biener <richard.guent...@gmail.com> wrote: > > On Mon, Dec 18, 2017 at 1:37 PM Richard Biener > <richard.guent...@gmail.com> wrote: > > > > On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: > > > On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener > > > <richard.guent...@gmail.com> wrote: > > >> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: > > >>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng <amker.ch...@gmail.com> > > >>> wrote: > > >>>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener > > >>>> <richard.guent...@gmail.com> wrote: > > >>>>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <bin.ch...@arm.com> wrote: > > >>>>>> Hi, > > >>>>>> As explained in the PR, given below test case: > > >>>>>> int a[8][10] = { [2][5] = 4 }, c; > > >>>>>> > > >>>>>> int > > >>>>>> main () > > >>>>>> { > > >>>>>> short b; > > >>>>>> int i, d; > > >>>>>> for (b = 4; b >= 0; b--) > > >>>>>> for (c = 0; c <= 6; c++) > > >>>>>> a[c + 1][b + 2] = a[c][b + 1]; > > >>>>>> for (i = 0; i < 8; i++) > > >>>>>> for (d = 0; d < 10; d++) > > >>>>>> if (a[i][d] != (i == 3 && d == 6) * 4) > > >>>>>> __builtin_abort (); > > >>>>>> return 0; > > >>>>>> > > >>>>>> the loop nest is illegal for vectorization without reversing inner > > >>>>>> loop. The issue > > >>>>>> is in data dependence checking of vectorizer, I believe the > > >>>>>> mentioned revision just > > >>>>>> exposed this. Previously the vectorization is skipped because of > > >>>>>> unsupported memory > > >>>>>> operation. The outer loop vectorization unrolls the outer loop into: > > >>>>>> > > >>>>>> for (b = 4; b > 0; b -= 4) > > >>>>>> { > > >>>>>> for (c = 0; c <= 6; c++) > > >>>>>> a[c + 1][6] = a[c][5]; > > >>>>>> for (c = 0; c <= 6; c++) > > >>>>>> a[c + 1][5] = a[c][4]; > > >>>>>> for (c = 0; c <= 6; c++) > > >>>>>> a[c + 1][4] = a[c][3]; > > >>>>>> for (c = 0; c <= 6; c++) > > >>>>>> a[c + 1][3] = a[c][2]; > > >>>>>> } > > >>>>>> Then four inner loops are fused into: > > >>>>>> for (b = 4; b > 0; b -= 4) > > >>>>>> { > > >>>>>> for (c = 0; c <= 6; c++) > > >>>>>> { > > >>>>>> a[c + 1][6] = a[c][5]; // S1 > > >>>>>> a[c + 1][5] = a[c][4]; // S2 > > >>>>>> a[c + 1][4] = a[c][3]; > > >>>>>> a[c + 1][3] = a[c][2]; > > >>>>>> } > > >>>>>> } > > >>>>> > > >>>>> Note that they are not really "fused" but they are interleaved. With > > >>>>> GIMPLE in mind > > >>>>> that makes a difference, you should get the equivalent of > > >>>>> > > >>>>> for (c = 0; c <= 6; c++) > > >>>>> { > > >>>>> tem1 = a[c][5]; > > >>>>> tem2 = a[c][4]; > > >>>>> tem3 = a[c][3]; > > >>>>> tem4 = a[c][2]; > > >>>>> a[c+1][6] = tem1; > > >>>>> a[c +1][5] = tem2; > > >>>>> a[c+1][4] = tem3; > > >>>>> a[c+1][3] = tem4; > > >>>>> } > > >>>> Yeah, I will double check if this abstract breaks the patch and how. > > >>> Hmm, I think this doesn't break it, well at least for part of the > > >>> analysis, because it is loop carried (backward) dependence goes wrong, > > >>> interleaving or not with the same iteration doesn't matter here. > > >> > > >> I think the idea is that forward dependences are always fine (negative > > >> distance) > > >> to vectorize. But with backward dependences we have to adhere to max_vf. > > >> > > >> It looks like for outer loop vectorization we only look at the distances > > >> in the > > >> outer loop but never at inner ones. But here the same applies but isn't > > >> that > > >> independend on the distances with respect to the outer loop? > > >> > > >> But maybe I'm misunderstanding how "distances" work here. > > > Hmm, I am not sure I understand "distance" correctly. With > > > description as in book like "Optimizing compilers for Modern > > > Architectures", distance is "# of iteration of sink ref - # of > > > iteration of source ref". Given below example: > > > for (i = 0; i < N; ++i) > > > { > > > x = arr[idx_1]; // S1 > > > arr[idx_2] = x; // S2 > > > } > > > if S1 is source ref, distance = idx_2 - idx_1, and distance > 0. Also > > > this is forward dependence. For example, idx_1 is i + 1 and idx_2 is > > > i; > > > If S2 is source ref, distance = idx_1 - idx_2, and distance < 0. Also > > > this is backward dependence. For example idx_1 is i and idx_2 is i + > > > 1; > > > > > > In GCC, we always try to subtract idx_2 from idx_1 first in computing > > > classic distance, we could result in negative distance in case of > > > backward dependence. When this happens at dependence carried level, > > > we manually reverse it. When this happens at inner level loop, we > > > simply keep the negative distance. And it's meaningless to talk about > > > forward/backward given dependence is carried at outer level loop. > > > > > > Outer loop vectorization is interesting. The problematic case has > > > backward dependence carried by outer loop. Because we don't check > > > dist vs. max_vf for such dep, the unrolled references could have outer > > > loop index equal to each other, as in a[c][5] and a[c+1][5]. So it's > > > like we have outer loop index resolved as equal. Now it has > > > dependence only if c == c' + 1. I think previous comment on fusion > > > still hold for this and we now need to check backward dependence > > > between the two reference at inner level loop. I guess it's a bit > > > trick to model dependence between a[c][5] and a[c+1][5] using the > > > original references and dependence. I think it's okay to do that > > > because distance of a[c][5] and a[c+1][5] is what we computed > > > previously for the original references at inner level loop. > > > > > > Not sure if I have missed something important. > > > > Not sure either. unroll-and-jam does > > > > FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v) > > { > > /* A distance (a,b) is at worst transformed into (a/N,b) by the > > unrolling (factor N), so the transformation is valid if > > a >= N, or b > 0, or b is zero and a > 0. Otherwise the unroll > > factor needs to be limited so that the first condition holds. > > That may limit the factor down to zero in the worst case. */ > > int dist = dist_v[0]; > > if (dist < 0) > > gcc_unreachable (); > > else if ((unsigned)dist >= *unroll) > > ; > > else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) > > - 1) > > || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) > > - 1) > > && dist > 0)) > > ; > > else > > *unroll = dist; > > > > where *unroll is similar to *max_vf I think. dist_v[0] is the innermost > > loop. > > > > The vectorizer does way more complicated things and only looks at the > > distance with respect to the outer loop as far as I can see which can > > be negative. > > > > Not sure if fusion and vectorizer "interleaving" makes a difference here. > > I think the idea was that when interleaving stmt-by-stmt then forward > > dependences would be preserved and thus we don't need to check the > > inner loop dependences. speaking with "forward vs. backward" dependences > > again, not distances... > > > > This also means that unroll-and-jam could be enhanced to "interleave" > > stmts and thus cover more cases? > > > > Again, I hope Micha can have a look here... > > Coming back to this... the vectorizer doesn't look at the inner loop > distances > at all at the moment. Bins orginal patch adds that for the case where the > outer loop evolution results in a negative dependence distance. But what > if the outer loop evolution results in a zero distance or a positive distance > within the bounds of a sensible max_vf? So I think we need to "simply" > look at all distance vector components, not just that of the outer loop > and perform all regular checks. That also allows adjustment of max_vf > for inner loop dependences in case vectorization with smaller vectors is > still possible. > > Doing that by making loop_depth iterate > fixes the testcase (as expected) but also regresses vect-outer-simd-[123].c > if we make the code look at ->safelen of the inner loop when processing > inner loop distances (because safelen is 0 there). Just using ->safelen > from the outer loop doesn't regress anything it seems. > > So I am going to test the following attached. I have added the > testcase also with reversed inner loop to verify we'd apply outer loop > vectorization to that one. > > Any comments? I kind of think that we are checking for different things in this way. IIUC, dependence checks in outer loop vectorizer could be categorized into three parts. The regular vect related check on outer loop; dependence check to fuse inner loop iterations; and dependence check to shuffle scalar statement/reference together. To pass the fusion check, we only need to make sure no backward dependence is generated in the result. So for case of zero/positive distance in outer loop evolution, it won't result in backward dependence? For example, if we adjust the original loop like:
for (b = 4; b >= 0; b--) for (c = 0; c <= 6; c++) - a[c + 1][b + 2] = a[c][b + 1]; + a[c + 1][b + 1] = a[c][b + 2]; The fusion result would be for (b = 4; b > 0; b -= 4) { for (c = 0; c <= 6; c++) { a[c + 1][5] = a[c][6]; // S1 a[c + 1][4] = a[c][5]; // S2 a[c + 1][3] = a[c][4]; a[c + 1][2] = a[c][3]; } } Though there is dependence between s1/s2, it's forward dependence now. Even we need to check the negative distance for inner loop, using the regular checks for fusion part for inner loop(s) is a bit strange. Using outer loop's safelen here is also unclear. I am not sure where we do the shuffle related check, is it (dist == 0) case? Is that the reason that the patch employs the whole regular checks? Thanks, bin > > Bootstrap & regtest running on x86_64-unknown-linux-gnu. > > Richard. > > 2019-03-21 Richard Biener <rguent...@suse.de> > > PR tree-optimization/81740 > * tree-vect-data-refs.c (vect_analyze_data_ref_dependence): > Perform distance vector related dependence checks for all > loops. > > * gcc.dg/vect/pr81740-1.c: New testcase. > * gcc.dg/vect/pr81740-2.c: Likewise.