On Tue, Dec 19, 2017 at 12:56 PM, Richard Biener <richard.guent...@gmail.com> wrote: > On Tue, Dec 19, 2017 at 12:58 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >> On Mon, Dec 18, 2017 at 2:35 PM, Michael Matz <m...@suse.de> wrote: >>> Hi, >>> >>> On Mon, 18 Dec 2017, Richard Biener wrote: >>> >>>> where *unroll is similar to *max_vf I think. dist_v[0] is the innermost >>>> loop. >>> >>> [0] is always outermost loop. >>> >>>> The vectorizer does way more complicated things and only looks at the >>>> distance with respect to the outer loop as far as I can see which can be >>>> negative. >>>> >>>> Not sure if fusion and vectorizer "interleaving" makes a difference >>>> here. I think the idea was that when interleaving stmt-by-stmt then >>>> forward dependences would be preserved and thus we don't need to check >>>> the inner loop dependences. speaking with "forward vs. backward" >>>> dependences again, not distances... >>>> >>>> This also means that unroll-and-jam could be enhanced to "interleave" >>>> stmts and thus cover more cases? >>>> >>>> Again, I hope Micha can have a look here... >>> >>> Haven't yet looked at the patch, but some comments anyway: >>> >>> fusion and interleaving interact in the following way in outer loop >>> vectorization, conceptually: >>> * (1) the outer loop is unrolled >>> * (2) the inner loops are fused >>> * (3) the (now single) inner body is rescheduled/shuffled/interleaved. >>> >> Thanks Michael for explaining issue clearer, this is what I meant. As >> for PR60276, I think it's actually the other side of the problem, >> which only relates to dependence validity of interleaving. > > Interleaving validity is what is checked by the current code, I don't > see any checking for validity of (2). Now, the current code only > looks at the outer loop distances to verify interleaving validity. > > I think we need to verify whether fusion is valid, preferably clearly > separated from the current code checking interleaving validity. > > I'm not 100% convinced the interleaving validity check is correct for > the outer loop vectorization case. > > I think it helps to reduce the dependence checking code to what we do > for unroll-and-jam: > > Index: gcc/tree-vect-data-refs.c > =================================================================== > --- gcc/tree-vect-data-refs.c (revision 255777) > +++ gcc/tree-vect-data-refs.c (working copy) > @@ -378,7 +378,26 @@ vect_analyze_data_ref_dependence (struct > dump_printf_loc (MSG_NOTE, vect_location, > "dependence distance = %d.\n", dist); > > - if (dist == 0) > + if (dist < 0) > + gcc_unreachable (); > + > + else if (dist >= *max_vf) > + { > + /* Dependence distance does not create dependence, as far as > + vectorization is concerned, in this case. */ > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_NOTE, vect_location, > + "dependence distance >= VF.\n"); > + continue; > + } > + > + else if (DDR_NB_LOOPS (ddr) == 2 > + && (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - > 1) > + || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - > 1) > + && dist > 0))) > + continue; > + > + else if (dist == 0) > { > if (dump_enabled_p ()) > { > @@ -427,26 +446,7 @@ vect_analyze_data_ref_dependence (struct > continue; > } > > - if (dist > 0 && DDR_REVERSED_P (ddr)) > - { > - /* If DDR_REVERSED_P the order of the data-refs in DDR was > - reversed (to make distance vector positive), and the actual > - distance is negative. */ > - if (dump_enabled_p ()) > - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > - "dependence distance negative.\n"); > - /* Record a negative dependence distance to later limit the > - amount of stmt copying / unrolling we can perform. > - Only need to handle read-after-write dependence. */ > - if (DR_IS_READ (drb) > - && (STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) == 0 > - || STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) > (unsigned)dist)) > - STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) = dist; > - continue; > - } Simply removing DDR_REVERSED_P doesn't work, as below failures indicating. > - > - if (abs (dist) >= 2 > - && abs (dist) < *max_vf) > + if (dist >= 2) > { > /* The dependence distance requires reduction of the maximal > vectorization factor. */ > @@ -455,15 +455,6 @@ vect_analyze_data_ref_dependence (struct > dump_printf_loc (MSG_NOTE, vect_location, > "adjusting maximal vectorization factor to %i\n", > *max_vf); > - } > - > - if (abs (dist) >= *max_vf) > - { > - /* Dependence distance does not create dependence, as far as > - vectorization is concerned, in this case. */ > - if (dump_enabled_p ()) > - dump_printf_loc (MSG_NOTE, vect_location, > - "dependence distance >= VF.\n"); > continue; > } > > > that fixes the testcase (probably by removing the DDR_REVERSED > special-casing) exposes in C vect.exp: > > FAIL: gcc.dg/vect/pr36630.c scan-tree-dump-times vect "vectorized 1 > loops" 1 (found 0 times) > FAIL: gcc.dg/vect/pr57558-2.c scan-tree-dump vect "vectorized 1 loops" > FAIL: gcc.dg/vect/vect-outer-5.c execution test > FAIL: gcc.dg/vect/vect-outer-5.c scan-tree-dump-times vect "OUTER LOOP > VECTORIZED" 1 (found 2 times) > > For the interleaving correctness a negative distance is fine since > we're only making it possibly > more negative by executing stmts from the n + 1 iteration before stmts > from the n'th iteration, right? > > But of course if the outer loop has negative distance that says > nothing about the inner loop. > > So sth like > > else if (DDR_REVERSED_P (ddr) > && (! nested_in_vect_loop_p (loop, DR_STMT (dra)) > || ! nested_in_vect_loop_p (loop, DR_STMT (drb)) > || dist_v[1] >= 0 > || abs (dist_v[1]) >= *max_vf)) This is incomplete/bogus? It still misses DDR_REVERSED_P in innermost loop vectorization? Because both "! nested_in_vect_loop_p" and "dist_v[1]" are only happens for outer loop vectorization? > { > /* If DDR_REVERSED_P the order of the data-refs in DDR was > reversed (to make distance vector positive), and the actual > distance is negative. */ > if (dump_enabled_p ()) > ... > > but then I wonder if PR60276 can also happen for inner loop refs. Also > what about dependences between stmts in inner and stmts in the outer > loop? IIRC, we can't really compute dependence for such cases.
Thanks, bin > > With the above gcc.dg/vect/vect-outer-5.c still FAILs... > > /* Outer-loop 2: Not vectorizable because of dependence distance. */ > for (i = 0; i < 4; i++) > { > s = 0; > for (j=0; j<N; j+=4) > s += C[j]; > B[i+1] = B[i] + s; > } > > that's because > > else if (DDR_NB_LOOPS (ddr) == 2 > && (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - > 1) > || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - > 1) > && dist > 0))) > continue; > > hits. We simply don't have a perfect nest to begin with... > > Richard. > >> Thanks, >> bin >>> (1) is always okay. But (2) and (3) as individual transformations must >>> both be checked for validity. If fusion isn't possible the whole >>> transformation is invalid, and if interleaving isn't possible the same is >>> true. In the specific example: >>> >>> for (b = 4; b >= 0; b--) >>> for (c = 0; c <= 6; c++) >>> t = a[c][b + 1]; // S1 >>> a[c + 1][b + 2] = t; // S2 >>> >>> it's already the fusion step that's invalid. There's a >>> dependence between S1 and S2, e.g. for (b,c) = (4,1) comes-before (3,0) >>> with S1(4,1) reading a[1][5] and S2(3,0) writing a[1][5]. So a >>> write-after-read. After fusing: >>> >>> for (c = 0; c <= 6; c++) >>> { >>> t = a[c][5]; // S1 >>> a[c + 1][6] = t; >>> t = a[c][4]; >>> a[c + 1][5] = t; // S2 >>> a[c + 1][4] = a[c][3]; >>> a[c + 1][3] = a[c][2]; >>> } >>> >>> here we have at iterations (c) = (0) comes-before (1), at S2(0) writing >>> a[1][5] and S1(1) writing a[1][5]. I.e. now it's a read-after-write (the >>> write in iteration 0 overwrites the value that is going to be read at >>> iteration 1, which wasn't the case in the original loop). The dependence >>> switched direction --> invalid. >>> >>> The simple interleaving of statements can't rectify this. >>> Interleaving is an inner-body reordering but the brokenness comes from a >>> cross-iteration ordering. >>> >>> This example can be unroll-jammed or outer-loop vectorized if one of the >>> two loops is reversed. Let's say we reverse the inner loop, so that it >>> runs in the same direction as the outer loop (reversal is possible here). >>> >>> It'd then be something like: >>> >>> for (c = 6; c >= 0; c--) >>> { >>> t = a[c][5]; // S1 >>> a[c + 1][6] = t; >>> t = a[c][4]; >>> a[c + 1][5] = t; // S2 >>> a[c + 1][4] = a[c][3]; >>> a[c + 1][3] = a[c][2]; >>> } >>> >>> The dependence between S1/S2 would still be a write-after-read, and all >>> would be well. This reversal of the inner loop can partly be simulated by >>> not only interleaving the inner insns, but by also _reodering_ them. But >>> AFAIK the vectorizer doesn't do this? >>> >>> >>> Ciao, >>> Michael.