On Fri, Mar 22, 2019 at 7:12 AM Bin.Cheng <amker.ch...@gmail.com> wrote: > > On Thu, Mar 21, 2019 at 8:24 PM Richard Biener > <richard.guent...@gmail.com> wrote: > > > > On Mon, Dec 18, 2017 at 1:37 PM Richard Biener > > <richard.guent...@gmail.com> wrote: > > > > > > On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: > > > > On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener > > > > <richard.guent...@gmail.com> wrote: > > > >> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng <amker.ch...@gmail.com> > > > >> wrote: > > > >>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng <amker.ch...@gmail.com> > > > >>> wrote: > > > >>>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener > > > >>>> <richard.guent...@gmail.com> wrote: > > > >>>>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <bin.ch...@arm.com> > > > >>>>> wrote: > > > >>>>>> Hi, > > > >>>>>> As explained in the PR, given below test case: > > > >>>>>> int a[8][10] = { [2][5] = 4 }, c; > > > >>>>>> > > > >>>>>> int > > > >>>>>> main () > > > >>>>>> { > > > >>>>>> short b; > > > >>>>>> int i, d; > > > >>>>>> for (b = 4; b >= 0; b--) > > > >>>>>> for (c = 0; c <= 6; c++) > > > >>>>>> a[c + 1][b + 2] = a[c][b + 1]; > > > >>>>>> for (i = 0; i < 8; i++) > > > >>>>>> for (d = 0; d < 10; d++) > > > >>>>>> if (a[i][d] != (i == 3 && d == 6) * 4) > > > >>>>>> __builtin_abort (); > > > >>>>>> return 0; > > > >>>>>> > > > >>>>>> the loop nest is illegal for vectorization without reversing inner > > > >>>>>> loop. The issue > > > >>>>>> is in data dependence checking of vectorizer, I believe the > > > >>>>>> mentioned revision just > > > >>>>>> exposed this. Previously the vectorization is skipped because of > > > >>>>>> unsupported memory > > > >>>>>> operation. The outer loop vectorization unrolls the outer loop > > > >>>>>> into: > > > >>>>>> > > > >>>>>> for (b = 4; b > 0; b -= 4) > > > >>>>>> { > > > >>>>>> for (c = 0; c <= 6; c++) > > > >>>>>> a[c + 1][6] = a[c][5]; > > > >>>>>> for (c = 0; c <= 6; c++) > > > >>>>>> a[c + 1][5] = a[c][4]; > > > >>>>>> for (c = 0; c <= 6; c++) > > > >>>>>> a[c + 1][4] = a[c][3]; > > > >>>>>> for (c = 0; c <= 6; c++) > > > >>>>>> a[c + 1][3] = a[c][2]; > > > >>>>>> } > > > >>>>>> Then four inner loops are fused into: > > > >>>>>> for (b = 4; b > 0; b -= 4) > > > >>>>>> { > > > >>>>>> for (c = 0; c <= 6; c++) > > > >>>>>> { > > > >>>>>> a[c + 1][6] = a[c][5]; // S1 > > > >>>>>> a[c + 1][5] = a[c][4]; // S2 > > > >>>>>> a[c + 1][4] = a[c][3]; > > > >>>>>> a[c + 1][3] = a[c][2]; > > > >>>>>> } > > > >>>>>> } > > > >>>>> > > > >>>>> Note that they are not really "fused" but they are interleaved. > > > >>>>> With > > > >>>>> GIMPLE in mind > > > >>>>> that makes a difference, you should get the equivalent of > > > >>>>> > > > >>>>> for (c = 0; c <= 6; c++) > > > >>>>> { > > > >>>>> tem1 = a[c][5]; > > > >>>>> tem2 = a[c][4]; > > > >>>>> tem3 = a[c][3]; > > > >>>>> tem4 = a[c][2]; > > > >>>>> a[c+1][6] = tem1; > > > >>>>> a[c +1][5] = tem2; > > > >>>>> a[c+1][4] = tem3; > > > >>>>> a[c+1][3] = tem4; > > > >>>>> } > > > >>>> Yeah, I will double check if this abstract breaks the patch and how. > > > >>> Hmm, I think this doesn't break it, well at least for part of the > > > >>> analysis, because it is loop carried (backward) dependence goes wrong, > > > >>> interleaving or not with the same iteration doesn't matter here. > > > >> > > > >> I think the idea is that forward dependences are always fine (negative > > > >> distance) > > > >> to vectorize. But with backward dependences we have to adhere to > > > >> max_vf. > > > >> > > > >> It looks like for outer loop vectorization we only look at the > > > >> distances in the > > > >> outer loop but never at inner ones. But here the same applies but > > > >> isn't that > > > >> independend on the distances with respect to the outer loop? > > > >> > > > >> But maybe I'm misunderstanding how "distances" work here. > > > > Hmm, I am not sure I understand "distance" correctly. With > > > > description as in book like "Optimizing compilers for Modern > > > > Architectures", distance is "# of iteration of sink ref - # of > > > > iteration of source ref". Given below example: > > > > for (i = 0; i < N; ++i) > > > > { > > > > x = arr[idx_1]; // S1 > > > > arr[idx_2] = x; // S2 > > > > } > > > > if S1 is source ref, distance = idx_2 - idx_1, and distance > 0. Also > > > > this is forward dependence. For example, idx_1 is i + 1 and idx_2 is > > > > i; > > > > If S2 is source ref, distance = idx_1 - idx_2, and distance < 0. Also > > > > this is backward dependence. For example idx_1 is i and idx_2 is i + > > > > 1; > > > > > > > > In GCC, we always try to subtract idx_2 from idx_1 first in computing > > > > classic distance, we could result in negative distance in case of > > > > backward dependence. When this happens at dependence carried level, > > > > we manually reverse it. When this happens at inner level loop, we > > > > simply keep the negative distance. And it's meaningless to talk about > > > > forward/backward given dependence is carried at outer level loop. > > > > > > > > Outer loop vectorization is interesting. The problematic case has > > > > backward dependence carried by outer loop. Because we don't check > > > > dist vs. max_vf for such dep, the unrolled references could have outer > > > > loop index equal to each other, as in a[c][5] and a[c+1][5]. So it's > > > > like we have outer loop index resolved as equal. Now it has > > > > dependence only if c == c' + 1. I think previous comment on fusion > > > > still hold for this and we now need to check backward dependence > > > > between the two reference at inner level loop. I guess it's a bit > > > > trick to model dependence between a[c][5] and a[c+1][5] using the > > > > original references and dependence. I think it's okay to do that > > > > because distance of a[c][5] and a[c+1][5] is what we computed > > > > previously for the original references at inner level loop. > > > > > > > > Not sure if I have missed something important. > > > > > > Not sure either. unroll-and-jam does > > > > > > FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v) > > > { > > > /* A distance (a,b) is at worst transformed into (a/N,b) by the > > > unrolling (factor N), so the transformation is valid if > > > a >= N, or b > 0, or b is zero and a > 0. Otherwise the > > > unroll > > > factor needs to be limited so that the first condition holds. > > > That may limit the factor down to zero in the worst case. */ > > > int dist = dist_v[0]; > > > if (dist < 0) > > > gcc_unreachable (); > > > else if ((unsigned)dist >= *unroll) > > > ; > > > else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS > > > (ddr) - 1) > > > || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS > > > (ddr) - 1) > > > && dist > 0)) > > > ; > > > else > > > *unroll = dist; > > > > > > where *unroll is similar to *max_vf I think. dist_v[0] is the innermost > > > loop. > > > > > > The vectorizer does way more complicated things and only looks at the > > > distance with respect to the outer loop as far as I can see which can > > > be negative. > > > > > > Not sure if fusion and vectorizer "interleaving" makes a difference here. > > > I think the idea was that when interleaving stmt-by-stmt then forward > > > dependences would be preserved and thus we don't need to check the > > > inner loop dependences. speaking with "forward vs. backward" dependences > > > again, not distances... > > > > > > This also means that unroll-and-jam could be enhanced to "interleave" > > > stmts and thus cover more cases? > > > > > > Again, I hope Micha can have a look here... > > > > Coming back to this... the vectorizer doesn't look at the inner loop > > distances > > at all at the moment. Bins orginal patch adds that for the case where the > > outer loop evolution results in a negative dependence distance. But what > > if the outer loop evolution results in a zero distance or a positive > > distance > > within the bounds of a sensible max_vf? So I think we need to "simply" > > look at all distance vector components, not just that of the outer loop > > and perform all regular checks. That also allows adjustment of max_vf > > for inner loop dependences in case vectorization with smaller vectors is > > still possible. > > > > Doing that by making loop_depth iterate > > fixes the testcase (as expected) but also regresses vect-outer-simd-[123].c > > if we make the code look at ->safelen of the inner loop when processing > > inner loop distances (because safelen is 0 there). Just using ->safelen > > from the outer loop doesn't regress anything it seems. > > > > So I am going to test the following attached. I have added the > > testcase also with reversed inner loop to verify we'd apply outer loop > > vectorization to that one. > > > > Any comments? > I kind of think that we are checking for different things in this way. > IIUC, dependence checks in outer loop vectorizer could be categorized > into three parts. The regular vect related check on outer loop; > dependence check to fuse inner loop iterations; and dependence check > to shuffle scalar statement/reference together. To pass the fusion > check, we only need to make sure no backward dependence is generated > in the result. So for case of zero/positive distance in outer loop > evolution, it won't result in backward dependence? For example, if we > adjust the original loop like: > > for (b = 4; b >= 0; b--) > for (c = 0; c <= 6; c++) > - a[c + 1][b + 2] = a[c][b + 1]; > + a[c + 1][b + 1] = a[c][b + 2]; > > The fusion result would be > for (b = 4; b > 0; b -= 4) > { > for (c = 0; c <= 6; c++) > { > a[c + 1][5] = a[c][6]; // S1 > a[c + 1][4] = a[c][5]; // S2 > a[c + 1][3] = a[c][4]; > a[c + 1][2] = a[c][3]; > } > } > Though there is dependence between s1/s2, it's forward dependence now.
Hmm, OK. But this doesn't vectorize with or without the patch, we claim t.c:9:3: note: dependence distance = 1. t.c:11:29: missed: not vectorized, possible dependence between data-refs a[c.3_15][_48] and a[_3][_50] t.c:9:3: missed: bad data dependence. t.c:9:3: missed: couldn't vectorize loop > Even we need to check the negative distance for inner loop, using the > regular checks for fusion part for inner loop(s) is a bit strange. > Using outer loop's safelen here is also unclear. > > I am not sure where we do the shuffle related check, is it (dist == 0) case? > Is that the reason that the patch employs the whole regular checks? I think the suffling needs to look at the outer loop distance and it is the == 0 and the forward dependence case where we check/constrain against the vectorization factor? I agree that formulating the outer loop dependence testing in a way to check the unroll-and-jam condition and the shuffling part separately would be best. I was mainly worried that your patch only hooks into the negative "outer" distance case and not the zero or positive distance one. For example for unroll-and-jam we constrain the maximum unroll factor. I wonder if we should simply call into its adjust_unroll_factor from the vectorizer or copy&paste it to the vectorizer. I'll note that with known dependence distances it seems to never compute failure... at least it computes the correct maximum vectorization factor for me. I also wonder about dependences for DRs in the outer loop which we'd even more heavily re-order. Anyways, we should somehow fix this. Attached is a patch variant cut&pasting the unroll-and-jam testing. Does this look better? Thanks, Richard. > Thanks, > bin > > > > Bootstrap & regtest running on x86_64-unknown-linux-gnu. > > > > Richard. > > > > 2019-03-21 Richard Biener <rguent...@suse.de> > > > > PR tree-optimization/81740 > > * tree-vect-data-refs.c (vect_analyze_data_ref_dependence): > > Perform distance vector related dependence checks for all > > loops. > > > > * gcc.dg/vect/pr81740-1.c: New testcase. > > * gcc.dg/vect/pr81740-2.c: Likewise.
fix-pr81740-2
Description: Binary data