Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization

Richard Biener Tue, 19 Dec 2017 04:57:07 -0800

On Tue, Dec 19, 2017 at 12:58 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Mon, Dec 18, 2017 at 2:35 PM, Michael Matz <m...@suse.de> wrote:
>> Hi,
>>
>> On Mon, 18 Dec 2017, Richard Biener wrote:
>>
>>> where *unroll is similar to *max_vf I think.  dist_v[0] is the innermost 
>>> loop.
>>
>> [0] is always outermost loop.
>>
>>> The vectorizer does way more complicated things and only looks at the
>>> distance with respect to the outer loop as far as I can see which can be
>>> negative.
>>>
>>> Not sure if fusion and vectorizer "interleaving" makes a difference
>>> here. I think the idea was that when interleaving stmt-by-stmt then
>>> forward dependences would be preserved and thus we don't need to check
>>> the inner loop dependences.  speaking with "forward vs. backward"
>>> dependences again, not distances...
>>>
>>> This also means that unroll-and-jam could be enhanced to "interleave"
>>> stmts and thus cover more cases?
>>>
>>> Again, I hope Micha can have a look here...
>>
>> Haven't yet looked at the patch, but some comments anyway:
>>
>> fusion and interleaving interact in the following way in outer loop
>> vectorization, conceptually:
>> * (1) the outer loop is unrolled
>> * (2) the inner loops are fused
>> * (3) the (now single) inner body is rescheduled/shuffled/interleaved.
>>
> Thanks Michael for explaining issue clearer, this is what I meant.  As
> for PR60276, I think it's actually the other side of the problem,
> which only relates to dependence validity of interleaving.


Interleaving validity is what is checked by the current code, I don't
see any checking for validity of (2).  Now, the current code only
looks at the outer loop distances to verify interleaving validity.

I think we need to verify whether fusion is valid, preferably clearly
separated from the current code checking interleaving validity.

I'm not 100% convinced the interleaving validity check is correct for
the outer loop vectorization case.

I think it helps to reduce the dependence checking code to what we do
for unroll-and-jam:

Index: gcc/tree-vect-data-refs.c
===================================================================
--- gcc/tree-vect-data-refs.c   (revision 255777)
+++ gcc/tree-vect-data-refs.c   (working copy)
@@ -378,7 +378,26 @@ vect_analyze_data_ref_dependence (struct
        dump_printf_loc (MSG_NOTE, vect_location,
                          "dependence distance  = %d.\n", dist);

-      if (dist == 0)
+      if (dist < 0)
+       gcc_unreachable ();
+
+      else if (dist >= *max_vf)
+       {
+         /* Dependence distance does not create dependence, as far as
+            vectorization is concerned, in this case.  */
+         if (dump_enabled_p ())
+           dump_printf_loc (MSG_NOTE, vect_location,
+                            "dependence distance >= VF.\n");
+         continue;
+       }
+
+      else if (DDR_NB_LOOPS (ddr) == 2
+              && (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
+                  || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
+                      && dist > 0)))
+       continue;
+
+      else if (dist == 0)
        {
          if (dump_enabled_p ())
            {
@@ -427,26 +446,7 @@ vect_analyze_data_ref_dependence (struct
          continue;
        }

-      if (dist > 0 && DDR_REVERSED_P (ddr))
-       {
-         /* If DDR_REVERSED_P the order of the data-refs in DDR was
-            reversed (to make distance vector positive), and the actual
-            distance is negative.  */
-         if (dump_enabled_p ())
-           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                            "dependence distance negative.\n");
-         /* Record a negative dependence distance to later limit the
-            amount of stmt copying / unrolling we can perform.
-            Only need to handle read-after-write dependence.  */
-         if (DR_IS_READ (drb)
-             && (STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) == 0
-                 || STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) > (unsigned)dist))
-           STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) = dist;
-         continue;
-       }
-
-      if (abs (dist) >= 2
-         && abs (dist) < *max_vf)
+      if (dist >= 2)
        {
          /* The dependence distance requires reduction of the maximal
             vectorization factor.  */
@@ -455,15 +455,6 @@ vect_analyze_data_ref_dependence (struct
            dump_printf_loc (MSG_NOTE, vect_location,
                             "adjusting maximal vectorization factor to %i\n",
                             *max_vf);
-       }
-
-      if (abs (dist) >= *max_vf)
-       {
-         /* Dependence distance does not create dependence, as far as
-            vectorization is concerned, in this case.  */
-         if (dump_enabled_p ())
-           dump_printf_loc (MSG_NOTE, vect_location,
-                            "dependence distance >= VF.\n");
          continue;
        }


that fixes the testcase (probably by removing the DDR_REVERSED
special-casing) exposes in C vect.exp:

FAIL: gcc.dg/vect/pr36630.c scan-tree-dump-times vect "vectorized 1
loops" 1 (found 0 times)
FAIL: gcc.dg/vect/pr57558-2.c scan-tree-dump vect "vectorized 1 loops"
FAIL: gcc.dg/vect/vect-outer-5.c execution test
FAIL: gcc.dg/vect/vect-outer-5.c scan-tree-dump-times vect "OUTER LOOP
VECTORIZED" 1 (found 2 times)

For the interleaving correctness a negative distance is fine since
we're only making it possibly
more negative by executing stmts from the n + 1 iteration before stmts
from the n'th iteration, right?

But of course if the outer loop has negative distance that says
nothing about the inner loop.

So sth like

      else if (DDR_REVERSED_P (ddr)
               && (! nested_in_vect_loop_p (loop, DR_STMT (dra))
                   || ! nested_in_vect_loop_p (loop, DR_STMT (drb))
                   || dist_v[1] >= 0
                   || abs (dist_v[1]) >= *max_vf))
        {
          /* If DDR_REVERSED_P the order of the data-refs in DDR was
             reversed (to make distance vector positive), and the actual
             distance is negative.  */
          if (dump_enabled_p ())
...

but then I wonder if PR60276 can also happen for inner loop refs.  Also
what about dependences between stmts in inner and stmts in the outer
loop?

With the above gcc.dg/vect/vect-outer-5.c still FAILs...

  /* Outer-loop 2: Not vectorizable because of dependence distance. */
  for (i = 0; i < 4; i++)
    {
      s = 0;
      for (j=0; j<N; j+=4)
        s += C[j];
      B[i+1] = B[i] + s;
    }

that's because

      else if (DDR_NB_LOOPS (ddr) == 2
               && (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
                   || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 1)
                       && dist > 0)))
        continue;

hits.  We simply don't have a perfect nest to begin with...

Richard.

> Thanks,
> bin
>> (1) is always okay.  But (2) and (3) as individual transformations must
>> both be checked for validity.  If fusion isn't possible the whole
>> transformation is invalid, and if interleaving isn't possible the same is
>> true.  In the specific example:
>>
>>   for (b = 4; b >= 0; b--)
>>     for (c = 0; c <= 6; c++)
>>       t = a[c][b + 1];      // S1
>>       a[c + 1][b + 2] = t;  // S2
>>
>> it's already the fusion step that's invalid.  There's a
>> dependence between S1 and S2, e.g. for (b,c) = (4,1) comes-before (3,0)
>> with S1(4,1) reading a[1][5] and S2(3,0) writing a[1][5].  So a
>> write-after-read.  After fusing:
>>
>>    for (c = 0; c <= 6; c++)
>>      {
>>        t = a[c][5];              // S1
>>        a[c + 1][6] = t;
>>        t = a[c][4];
>>        a[c + 1][5] = t;          // S2
>>        a[c + 1][4] = a[c][3];
>>        a[c + 1][3] = a[c][2];
>>      }
>>
>> here we have at iterations (c) = (0) comes-before (1), at S2(0) writing
>> a[1][5] and S1(1) writing a[1][5].  I.e. now it's a read-after-write (the
>> write in iteration 0 overwrites the value that is going to be read at
>> iteration 1, which wasn't the case in the original loop).  The dependence
>> switched direction --> invalid.
>>
>> The simple interleaving of statements can't rectify this.
>> Interleaving is an inner-body reordering but the brokenness comes from a
>> cross-iteration ordering.
>>
>> This example can be unroll-jammed or outer-loop vectorized if one of the
>> two loops is reversed.  Let's say we reverse the inner loop, so that it
>> runs in the same direction as the outer loop (reversal is possible here).
>>
>> It'd then be something like:
>>
>>    for (c = 6; c >= 0; c--)
>>      {
>>        t = a[c][5];              // S1
>>        a[c + 1][6] = t;
>>        t = a[c][4];
>>        a[c + 1][5] = t;          // S2
>>        a[c + 1][4] = a[c][3];
>>        a[c + 1][3] = a[c][2];
>>      }
>>
>> The dependence between S1/S2 would still be a write-after-read, and all
>> would be well.  This reversal of the inner loop can partly be simulated by
>> not only interleaving the inner insns, but by also _reodering_ them.  But
>> AFAIK the vectorizer doesn't do this?
>>
>>
>> Ciao,
>> Michael.

Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization

Reply via email to