On Tue, Dec 19, 2017 at 12:56 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Tue, Dec 19, 2017 at 12:58 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Mon, Dec 18, 2017 at 2:35 PM, Michael Matz <m...@suse.de> wrote:
>>> Hi,
>>>
>>> On Mon, 18 Dec 2017, Richard Biener wrote:
>>>
>>>> where *unroll is similar to *max_vf I think.  dist_v[0] is the innermost 
>>>> loop.
>>>
>>> [0] is always outermost loop.
>>>
>>>> The vectorizer does way more complicated things and only looks at the
>>>> distance with respect to the outer loop as far as I can see which can be
>>>> negative.
>>>>
>>>> Not sure if fusion and vectorizer "interleaving" makes a difference
>>>> here. I think the idea was that when interleaving stmt-by-stmt then
>>>> forward dependences would be preserved and thus we don't need to check
>>>> the inner loop dependences.  speaking with "forward vs. backward"
>>>> dependences again, not distances...
>>>>
>>>> This also means that unroll-and-jam could be enhanced to "interleave"
>>>> stmts and thus cover more cases?
>>>>
>>>> Again, I hope Micha can have a look here...
>>>
>>> Haven't yet looked at the patch, but some comments anyway:
>>>
>>> fusion and interleaving interact in the following way in outer loop
>>> vectorization, conceptually:
>>> * (1) the outer loop is unrolled
>>> * (2) the inner loops are fused
>>> * (3) the (now single) inner body is rescheduled/shuffled/interleaved.
>>>
>> Thanks Michael for explaining issue clearer, this is what I meant.  As
>> for PR60276, I think it's actually the other side of the problem,
>> which only relates to dependence validity of interleaving.
>
> Interleaving validity is what is checked by the current code, I don't
> see any checking for validity of (2).  Now, the current code only
> looks at the outer loop distances to verify interleaving validity.
>
> I think we need to verify whether fusion is valid, preferably clearly
> separated from the current code checking interleaving validity.
>
> I'm not 100% convinced the interleaving validity check is correct for
> the outer loop vectorization case.
>
> I think it helps to reduce the dependence checking code to what we do
> for unroll-and-jam:
>
> Index: gcc/tree-vect-data-refs.c
> ===================================================================
> --- gcc/tree-vect-data-refs.c   (revision 255777)
> +++ gcc/tree-vect-data-refs.c   (working copy)
> @@ -378,7 +378,26 @@ vect_analyze_data_ref_dependence (struct
>         dump_printf_loc (MSG_NOTE, vect_location,
>                           "dependence distance  = %d.\n", dist);
>
> -      if (dist == 0)
> +      if (dist < 0)
> +       gcc_unreachable ();
> +
> +      else if (dist >= *max_vf)
> +       {
> +         /* Dependence distance does not create dependence, as far as
> +            vectorization is concerned, in this case.  */
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_NOTE, vect_location,
> +                            "dependence distance >= VF.\n");
> +         continue;
> +       }
> +
> +      else if (DDR_NB_LOOPS (ddr) == 2
> +              && (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 
> 1)
> +                  || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 
> 1)
> +                      && dist > 0)))
> +       continue;
> +
> +      else if (dist == 0)
>         {
>           if (dump_enabled_p ())
>             {
> @@ -427,26 +446,7 @@ vect_analyze_data_ref_dependence (struct
>           continue;
>         }
>
> -      if (dist > 0 && DDR_REVERSED_P (ddr))
> -       {
> -         /* If DDR_REVERSED_P the order of the data-refs in DDR was
> -            reversed (to make distance vector positive), and the actual
> -            distance is negative.  */
> -         if (dump_enabled_p ())
> -           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> -                            "dependence distance negative.\n");
> -         /* Record a negative dependence distance to later limit the
> -            amount of stmt copying / unrolling we can perform.
> -            Only need to handle read-after-write dependence.  */
> -         if (DR_IS_READ (drb)
> -             && (STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) == 0
> -                 || STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) > (unsigned)dist))
> -           STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) = dist;
> -         continue;
> -       }
Simply removing DDR_REVERSED_P doesn't work, as below failures indicating.
> -
> -      if (abs (dist) >= 2
> -         && abs (dist) < *max_vf)
> +      if (dist >= 2)
>         {
>           /* The dependence distance requires reduction of the maximal
>              vectorization factor.  */
> @@ -455,15 +455,6 @@ vect_analyze_data_ref_dependence (struct
>             dump_printf_loc (MSG_NOTE, vect_location,
>                              "adjusting maximal vectorization factor to %i\n",
>                              *max_vf);
> -       }
> -
> -      if (abs (dist) >= *max_vf)
> -       {
> -         /* Dependence distance does not create dependence, as far as
> -            vectorization is concerned, in this case.  */
> -         if (dump_enabled_p ())
> -           dump_printf_loc (MSG_NOTE, vect_location,
> -                            "dependence distance >= VF.\n");
>           continue;
>         }
>
>
> that fixes the testcase (probably by removing the DDR_REVERSED
> special-casing) exposes in C vect.exp:
>
> FAIL: gcc.dg/vect/pr36630.c scan-tree-dump-times vect "vectorized 1
> loops" 1 (found 0 times)
> FAIL: gcc.dg/vect/pr57558-2.c scan-tree-dump vect "vectorized 1 loops"
> FAIL: gcc.dg/vect/vect-outer-5.c execution test
> FAIL: gcc.dg/vect/vect-outer-5.c scan-tree-dump-times vect "OUTER LOOP
> VECTORIZED" 1 (found 2 times)
>
> For the interleaving correctness a negative distance is fine since
> we're only making it possibly
> more negative by executing stmts from the n + 1 iteration before stmts
> from the n'th iteration, right?
>
> But of course if the outer loop has negative distance that says
> nothing about the inner loop.
>
> So sth like
>
>       else if (DDR_REVERSED_P (ddr)
>                && (! nested_in_vect_loop_p (loop, DR_STMT (dra))
>                    || ! nested_in_vect_loop_p (loop, DR_STMT (drb))
>                    || dist_v[1] >= 0
>                    || abs (dist_v[1]) >= *max_vf))
This is incomplete/bogus?  It still misses DDR_REVERSED_P in innermost
loop vectorization?  Because both "! nested_in_vect_loop_p" and
"dist_v[1]" are only happens for outer loop vectorization?
>         {
>           /* If DDR_REVERSED_P the order of the data-refs in DDR was
>              reversed (to make distance vector positive), and the actual
>              distance is negative.  */
>           if (dump_enabled_p ())
> ...
>
> but then I wonder if PR60276 can also happen for inner loop refs.  Also
> what about dependences between stmts in inner and stmts in the outer
> loop?
IIRC, we can't really compute dependence for such cases.

Thanks,
bin
>
> With the above gcc.dg/vect/vect-outer-5.c still FAILs...
>
>   /* Outer-loop 2: Not vectorizable because of dependence distance. */
>   for (i = 0; i < 4; i++)
>     {
>       s = 0;
>       for (j=0; j<N; j+=4)
>         s += C[j];
>       B[i+1] = B[i] + s;
>     }
>
> that's because
>
>       else if (DDR_NB_LOOPS (ddr) == 2
>                && (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 
> 1)
>                    || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 
> 1)
>                        && dist > 0)))
>         continue;
>
> hits.  We simply don't have a perfect nest to begin with...
>
> Richard.
>
>> Thanks,
>> bin
>>> (1) is always okay.  But (2) and (3) as individual transformations must
>>> both be checked for validity.  If fusion isn't possible the whole
>>> transformation is invalid, and if interleaving isn't possible the same is
>>> true.  In the specific example:
>>>
>>>   for (b = 4; b >= 0; b--)
>>>     for (c = 0; c <= 6; c++)
>>>       t = a[c][b + 1];      // S1
>>>       a[c + 1][b + 2] = t;  // S2
>>>
>>> it's already the fusion step that's invalid.  There's a
>>> dependence between S1 and S2, e.g. for (b,c) = (4,1) comes-before (3,0)
>>> with S1(4,1) reading a[1][5] and S2(3,0) writing a[1][5].  So a
>>> write-after-read.  After fusing:
>>>
>>>    for (c = 0; c <= 6; c++)
>>>      {
>>>        t = a[c][5];              // S1
>>>        a[c + 1][6] = t;
>>>        t = a[c][4];
>>>        a[c + 1][5] = t;          // S2
>>>        a[c + 1][4] = a[c][3];
>>>        a[c + 1][3] = a[c][2];
>>>      }
>>>
>>> here we have at iterations (c) = (0) comes-before (1), at S2(0) writing
>>> a[1][5] and S1(1) writing a[1][5].  I.e. now it's a read-after-write (the
>>> write in iteration 0 overwrites the value that is going to be read at
>>> iteration 1, which wasn't the case in the original loop).  The dependence
>>> switched direction --> invalid.
>>>
>>> The simple interleaving of statements can't rectify this.
>>> Interleaving is an inner-body reordering but the brokenness comes from a
>>> cross-iteration ordering.
>>>
>>> This example can be unroll-jammed or outer-loop vectorized if one of the
>>> two loops is reversed.  Let's say we reverse the inner loop, so that it
>>> runs in the same direction as the outer loop (reversal is possible here).
>>>
>>> It'd then be something like:
>>>
>>>    for (c = 6; c >= 0; c--)
>>>      {
>>>        t = a[c][5];              // S1
>>>        a[c + 1][6] = t;
>>>        t = a[c][4];
>>>        a[c + 1][5] = t;          // S2
>>>        a[c + 1][4] = a[c][3];
>>>        a[c + 1][3] = a[c][2];
>>>      }
>>>
>>> The dependence between S1/S2 would still be a write-after-read, and all
>>> would be well.  This reversal of the inner loop can partly be simulated by
>>> not only interleaving the inner insns, but by also _reodering_ them.  But
>>> AFAIK the vectorizer doesn't do this?
>>>
>>>
>>> Ciao,
>>> Michael.

Reply via email to