On Tue, Dec 19, 2017 at 12:58 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: > On Mon, Dec 18, 2017 at 2:35 PM, Michael Matz <m...@suse.de> wrote: >> Hi, >> >> On Mon, 18 Dec 2017, Richard Biener wrote: >> >>> where *unroll is similar to *max_vf I think. dist_v[0] is the innermost >>> loop. >> >> [0] is always outermost loop. >> >>> The vectorizer does way more complicated things and only looks at the >>> distance with respect to the outer loop as far as I can see which can be >>> negative. >>> >>> Not sure if fusion and vectorizer "interleaving" makes a difference >>> here. I think the idea was that when interleaving stmt-by-stmt then >>> forward dependences would be preserved and thus we don't need to check >>> the inner loop dependences. speaking with "forward vs. backward" >>> dependences again, not distances... >>> >>> This also means that unroll-and-jam could be enhanced to "interleave" >>> stmts and thus cover more cases? >>> >>> Again, I hope Micha can have a look here... >> >> Haven't yet looked at the patch, but some comments anyway: >> >> fusion and interleaving interact in the following way in outer loop >> vectorization, conceptually: >> * (1) the outer loop is unrolled >> * (2) the inner loops are fused >> * (3) the (now single) inner body is rescheduled/shuffled/interleaved. >> > Thanks Michael for explaining issue clearer, this is what I meant. As > for PR60276, I think it's actually the other side of the problem, > which only relates to dependence validity of interleaving.
Interleaving validity is what is checked by the current code, I don't see any checking for validity of (2). Now, the current code only looks at the outer loop distances to verify interleaving validity. I think we need to verify whether fusion is valid, preferably clearly separated from the current code checking interleaving validity. I'm not 100% convinced the interleaving validity check is correct for the outer loop vectorization case. I think it helps to reduce the dependence checking code to what we do for unroll-and-jam: Index: gcc/tree-vect-data-refs.c =================================================================== --- gcc/tree-vect-data-refs.c (revision 255777) +++ gcc/tree-vect-data-refs.c (working copy) @@ -378,7 +378,26 @@ vect_analyze_data_ref_dependence (struct dump_printf_loc (MSG_NOTE, vect_location, "dependence distance = %d.\n", dist); - if (dist == 0) + if (dist < 0) + gcc_unreachable (); + + else if (dist >= *max_vf) + { + /* Dependence distance does not create dependence, as far as + vectorization is concerned, in this case. */ + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "dependence distance >= VF.\n"); + continue; + } + + else if (DDR_NB_LOOPS (ddr) == 2 + && (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1) + || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1) + && dist > 0))) + continue; + + else if (dist == 0) { if (dump_enabled_p ()) { @@ -427,26 +446,7 @@ vect_analyze_data_ref_dependence (struct continue; } - if (dist > 0 && DDR_REVERSED_P (ddr)) - { - /* If DDR_REVERSED_P the order of the data-refs in DDR was - reversed (to make distance vector positive), and the actual - distance is negative. */ - if (dump_enabled_p ()) - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, - "dependence distance negative.\n"); - /* Record a negative dependence distance to later limit the - amount of stmt copying / unrolling we can perform. - Only need to handle read-after-write dependence. */ - if (DR_IS_READ (drb) - && (STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) == 0 - || STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) > (unsigned)dist)) - STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) = dist; - continue; - } - - if (abs (dist) >= 2 - && abs (dist) < *max_vf) + if (dist >= 2) { /* The dependence distance requires reduction of the maximal vectorization factor. */ @@ -455,15 +455,6 @@ vect_analyze_data_ref_dependence (struct dump_printf_loc (MSG_NOTE, vect_location, "adjusting maximal vectorization factor to %i\n", *max_vf); - } - - if (abs (dist) >= *max_vf) - { - /* Dependence distance does not create dependence, as far as - vectorization is concerned, in this case. */ - if (dump_enabled_p ()) - dump_printf_loc (MSG_NOTE, vect_location, - "dependence distance >= VF.\n"); continue; } that fixes the testcase (probably by removing the DDR_REVERSED special-casing) exposes in C vect.exp: FAIL: gcc.dg/vect/pr36630.c scan-tree-dump-times vect "vectorized 1 loops" 1 (found 0 times) FAIL: gcc.dg/vect/pr57558-2.c scan-tree-dump vect "vectorized 1 loops" FAIL: gcc.dg/vect/vect-outer-5.c execution test FAIL: gcc.dg/vect/vect-outer-5.c scan-tree-dump-times vect "OUTER LOOP VECTORIZED" 1 (found 2 times) For the interleaving correctness a negative distance is fine since we're only making it possibly more negative by executing stmts from the n + 1 iteration before stmts from the n'th iteration, right? But of course if the outer loop has negative distance that says nothing about the inner loop. So sth like else if (DDR_REVERSED_P (ddr) && (! nested_in_vect_loop_p (loop, DR_STMT (dra)) || ! nested_in_vect_loop_p (loop, DR_STMT (drb)) || dist_v[1] >= 0 || abs (dist_v[1]) >= *max_vf)) { /* If DDR_REVERSED_P the order of the data-refs in DDR was reversed (to make distance vector positive), and the actual distance is negative. */ if (dump_enabled_p ()) ... but then I wonder if PR60276 can also happen for inner loop refs. Also what about dependences between stmts in inner and stmts in the outer loop? With the above gcc.dg/vect/vect-outer-5.c still FAILs... /* Outer-loop 2: Not vectorizable because of dependence distance. */ for (i = 0; i < 4; i++) { s = 0; for (j=0; j<N; j+=4) s += C[j]; B[i+1] = B[i] + s; } that's because else if (DDR_NB_LOOPS (ddr) == 2 && (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1) || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1) && dist > 0))) continue; hits. We simply don't have a perfect nest to begin with... Richard. > Thanks, > bin >> (1) is always okay. But (2) and (3) as individual transformations must >> both be checked for validity. If fusion isn't possible the whole >> transformation is invalid, and if interleaving isn't possible the same is >> true. In the specific example: >> >> for (b = 4; b >= 0; b--) >> for (c = 0; c <= 6; c++) >> t = a[c][b + 1]; // S1 >> a[c + 1][b + 2] = t; // S2 >> >> it's already the fusion step that's invalid. There's a >> dependence between S1 and S2, e.g. for (b,c) = (4,1) comes-before (3,0) >> with S1(4,1) reading a[1][5] and S2(3,0) writing a[1][5]. So a >> write-after-read. After fusing: >> >> for (c = 0; c <= 6; c++) >> { >> t = a[c][5]; // S1 >> a[c + 1][6] = t; >> t = a[c][4]; >> a[c + 1][5] = t; // S2 >> a[c + 1][4] = a[c][3]; >> a[c + 1][3] = a[c][2]; >> } >> >> here we have at iterations (c) = (0) comes-before (1), at S2(0) writing >> a[1][5] and S1(1) writing a[1][5]. I.e. now it's a read-after-write (the >> write in iteration 0 overwrites the value that is going to be read at >> iteration 1, which wasn't the case in the original loop). The dependence >> switched direction --> invalid. >> >> The simple interleaving of statements can't rectify this. >> Interleaving is an inner-body reordering but the brokenness comes from a >> cross-iteration ordering. >> >> This example can be unroll-jammed or outer-loop vectorized if one of the >> two loops is reversed. Let's say we reverse the inner loop, so that it >> runs in the same direction as the outer loop (reversal is possible here). >> >> It'd then be something like: >> >> for (c = 6; c >= 0; c--) >> { >> t = a[c][5]; // S1 >> a[c + 1][6] = t; >> t = a[c][4]; >> a[c + 1][5] = t; // S2 >> a[c + 1][4] = a[c][3]; >> a[c + 1][3] = a[c][2]; >> } >> >> The dependence between S1/S2 would still be a write-after-read, and all >> would be well. This reversal of the inner loop can partly be simulated by >> not only interleaving the inner insns, but by also _reodering_ them. But >> AFAIK the vectorizer doesn't do this? >> >> >> Ciao, >> Michael.