On Fri, 12 Apr 2024, Tamar Christina wrote:

> Hi All,
> 
> This is a story all about how the peeling for gaps introduces a bug in the 
> upper
> bounds.
> 
> Before I go further, I'll first explain how I understand this to work for 
> loops
> with a single exit.
> 
> When peeling for gaps we peel N < VF iterations to scalar.
> This happens by removing N iterations from the calculation of niters such that
> vect_iters * VF == niters is always false.
> 
> In other words, when we exit the vector loop we always fall to the scalar 
> loop.
> The loop bounds adjustment guarantees this. Because of this we potentially
> execute a vector loop iteration less.  That is, if you're at the boundary
> condition where niters % VF by peeling one or more scalar iterations the 
> vector
> loop executes one less.
> 
> This is accounted for by the adjustments in vect_transform_loops.  This
> adjustment happens differently based on whether the the vector loop can be
> partial or not:
> 
> Peeling for gaps sets the bias to 0 and then:
> 
> when not partial:  we take the floor of (scalar_upper_bound / VF) - 1 to get 
> the
>                  vector latch iteration count.
> 
> when loop is partial:  For a single exit this means the loop is masked, we 
> take
>                        the ceil to account for the fact that the loop can 
> handle
>                      the final partial iteration using masking.
> 
> Note that there's no difference between ceil an floor on the boundary 
> condition.
> There is a difference however when you're slightly above it. i.e. if scalar
> iterates 14 times and VF = 4 and we peel 1 iteration for gaps.
> 
> The partial loop does ((13 + 0) / 4) - 1 == 2 vector iterations. and in effect
> the partial iteration is ignored and it's done as scalar.
> 
> This is fine because the niters modification has capped the vector iteration 
> at
> 2.  So that when we reduce the induction values you end up entering the scalar
> code with ind_var.2 = ind_var.1 + 2 * VF.
> 
> Now lets look at early breaks.  To make it esier I'll focus on the specific
> testcase:
> 
> char buffer[64];
> 
> __attribute__ ((noipa))
> buff_t *copy (buff_t *first, buff_t *last)
> {
>   char *buffer_ptr = buffer;
>   char *const buffer_end = &buffer[SZ-1];
>   int store_size = sizeof(first->Val);
>   while (first != last && (buffer_ptr + store_size) <= buffer_end)
>     {
>       const char *value_data = (const char *)(&first->Val);
>       __builtin_memcpy(buffer_ptr, value_data, store_size);
>       buffer_ptr += store_size;
>       ++first;
>     }
> 
>   if (first == last)
>     return 0;
> 
>   return first;
> }
> 
> Here the first, early exit is on the condition:
> 
>   (buffer_ptr + store_size) <= buffer_end
> 
> and the main exit is on condition:
> 
>   first != last
> 
> This is important, as this bug only manifests itself when the first exit has a
> known constant iteration count that's lower than the latch exit count.
> 
> because buffer holds 64 bytes, and VF = 4, unroll = 2, we end up processing 16
> bytes per iteration.  So the exit has a known bounds of 8 + 1.
> 
> The vectorizer correctly analizes this:
> 
> Statement (exit)if (ivtmp_21 != 0)
>  is executed at most 8 (bounded by 8) + 1 times in loop 1.
> 
> and as a consequence the IV is bound by 9:
> 
>   # vect_vec_iv_.14_117 = PHI <_118(9), { 9, 8, 7, 6 }(20)>
>   ...
>   vect_ivtmp_21.16_124 = vect_vec_iv_.14_117 + { 18446744073709551615, 
> 18446744073709551615, 18446744073709551615, 18446744073709551615 };
>   mask_patt_22.17_126 = vect_ivtmp_21.16_124 != { 0, 0, 0, 0 };
>   if (mask_patt_22.17_126 == { -1, -1, -1, -1 })
>     goto <bb 3>; [88.89%]
>   else
>     goto <bb 30>; [11.11%]
> 
> The imporant bits are this:
> 
> In this example the value of last - first = 416.
> 
> the calculated vector iteration count, is:
> 
>     x = (((ptr2 - ptr1) - 16) / 16) + 1 = 27
> 
> the bounds generated, adjusting for gaps:
> 
>    x == (((x - 1) >> 2) << 2)
> 
> which means we'll always fall through to the scalar code. as intended.
> 
> Here are two key things to note:
> 
> 1. In this loop, the early exit will always be the one taken.  When it's taken
>    we enter the scalar loop with the correct induction value to apply the gap
>    peeling.
> 
> 2. If the main exit is taken, the induction values assumes you've finished all
>    vector iterations.  i.e. it assumes you have completed 24 iterations, as we
>    treat the main exit the same for normal loop vect and early break when not
>    PEELED.
>    This means the induction value is adjusted to ind_var.2 = ind_var.1 + 24 * 
> VF;
> 
> So what's going wrong.  The vectorizer's codegen is correct and efficient,
> however when we adjust the upper bounds, that code knows that the loops upper
> bound is based on the early exit. i.e. 8 latch iterations. or in other words.
> It thinks the loop iterates once.
> 
> This is incorrect as the vector loop iterates twice, as it has set up the
> induction value such that it exits at the early exit.   So it in effect 
> iterates
> 2.5x times.
> 
> Becuase the upper bound is incorrect, when we unroll it now exits from the 
> main
> exit which uses the incorrect induction value.
> 
> So there are three ways to fix this:
> 
> 1.  If we take the position that the main exit should support both premature
>     exits and final exits then vect_update_ivs_after_vectorizer needs to be
>     skipped for this case, and vectorizable_induction updated with  third case
>     where we reduce with LAST reduction based on the IVs instead of assuming
>     you're at the end of the vector loop.
> 
>     I don't like this approach.  It don't think we should add a third 
> induction
>     style to cover up an issue introduced by unrolling.  It makes the code
>     harder to follow and makes main exits harder to reason about.
> 
> 2. We could say that vec_init_loop_exit_info should pick the exit which has 
> the
>    smallest known iteration count.  This would turn this case into a PEELED 
> case
>    and the induction values would be correct as we'd always recalculate them
>    from a reduction.  This is suboptimal though as the reason we pick the 
> latch
>    exit as the IV one is to prevent having to rotate the loop.  This results
>    in more efficient code for what we assume is the common case, i.e. the main
>    exit.
> 
> 3. In PR113734 we've established that for vectorization of early breaks that 
> we
>    must always treat the loop as partial.  Here partiallity means that we have
>    enough vector elements to start the iteration, but we may take an early 
> exit
>    and so never reach the latch/main exit.
> 
>    This requirement is overwritten by the peeling for gaps adjustment of the
>    upper bound.  I believe the bug is simply that this shouldn't be done.
>    The adjustment here is to indicate that the main exit always leads to the
>    scalar loop when peeling for gaps.
> 
>    But this invariant is already always true for all early exits.  Remember 
> that
>    early exits restart the scalar loop at the start of the vector iteration, 
> so
>    the induction values will start it where we want to do the gaps peeling.
> 
> I think no# 3 is the correct fix, and also one that doesn't degrade code 
> quality.
> 
> Note: I used memcpy and memcmp in the testcase, I'm not sure if I can rely on
>       these being inlined? but I also don't know how to test for library 
> support.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu and
> no issues.
> 
> Ok for master?

I agree with the patch outcome.  I think the reasoning is simpler
because peeling for gaps applies to the scalar iteration IV but
with multiple exits an upper bound can effectively apply to a
different exit and thus a different "IV" we cannot simply adjust
an existing upper bound as if it were a bound for the scalar iteration IV.

So I think we should word it like the following.  Can you pick that
up and also add the other testcase I produced?

OK with those changes.

Thanks,
Richard.

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 3ffcac8c613..c728856079b 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -12152,14 +12152,14 @@ vect_transform_loop (loop_vec_info loop_vinfo, 
gimple 
*loop_vectorized_call)
   bool final_iter_may_be_partial
     = LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
       || LOOP_VINFO_EARLY_BREAKS (loop_vinfo);
-  /* The minimum number of iterations performed by the epilogue.  This
-     is 1 when peeling for gaps because we always need a final scalar
-     iteration.  */
-  int min_epilogue_iters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 
0;
-  /* +1 to convert latch counts to loop iteration counts,
-     -min_epilogue_iters to remove iterations that cannot be performed
-       by the vector code.  */
-  int bias_for_lowest = 1 - min_epilogue_iters;
+  /* +1 to convert latch counts to loop iteration counts.  */
+  int bias_for_lowest = 1;
+  /* When we are peeling for gaps then we take away one scalar iteration
+     from the vector loop.  Thus we can adjust the upper bound by one
+     scalar iteration.  But only when we know the bound applies to the
+     IV exit test which might not be true when we have multiple exits.  
*/
+  if (! LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
+    bias_for_lowest -= LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0;
   int bias_for_assumed = bias_for_lowest;
   int alignment_npeels = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
   if (alignment_npeels && LOOP_VINFO_USING_PARTIAL_VECTORS_P 
(loop_vinfo))


> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>       PR tree-optimization/114403
>       * tree-vect-loop.cc (vect_transform_loop): Adjust upper bounds for when
>       peeling for gaps and early break.
> 
> gcc/testsuite/ChangeLog:
> 
>       PR tree-optimization/114403
>       * gcc.dg/vect/vect-early-break_124-pr114403.c: New test.
> 
> ---
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c 
> b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..ae5e53efc45e7bef89c5a72abd6afa48292668db
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c
> @@ -0,0 +1,74 @@
> +/* { dg-add-options vect_early_break } */
> +/* { dg-require-effective-target vect_early_break_hw } */
> +/* { dg-require-effective-target vect_long_long } */
> +
> +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> +
> +#include "tree-vect.h"
> +
> +typedef unsigned long PV;
> +typedef struct _buff_t {
> +    int foo;
> +    PV Val;
> +} buff_t;
> +
> +#define NUM 9
> +#define SZ NUM * sizeof (PV)
> +char buffer[SZ];
> +
> +__attribute__ ((noipa))
> +buff_t *copy (buff_t *first, buff_t *last)
> +{
> +  char *buffer_ptr = buffer;
> +  char *const buffer_end = &buffer[SZ-1];
> +  int store_size = sizeof(first->Val);
> +  while (first != last && (buffer_ptr + store_size) <= buffer_end)
> +    {
> +      const char *value_data = (const char *)(&first->Val);
> +      __builtin_memcpy(buffer_ptr, value_data, store_size);
> +      buffer_ptr += store_size;
> +      ++first;
> +    }
> +
> +  if (first == last)
> +    return 0;
> +
> +  return first;
> +}
> +
> +int main ()
> +{
> +  /* Copy an ascii buffer.  We need to trigger the loop to exit from
> +     the condition where we have more data to copy but not enough space.
> +     For this test that means that OVL must be > SZ.  */
> +#define OVL NUM*2
> +  char str[OVL]="abcdefghiabcdefgh\0";
> +  buff_t tmp[OVL];
> +
> +#pragma GCC novector
> +  for (int i = 0; i < OVL; i++)
> +    tmp[i].Val = str[i];
> +
> +  buff_t *start = &tmp[0];
> +  buff_t *last = &tmp[OVL-1];
> +  buff_t *res = 0;
> +
> +  /* This copy should exit on the early exit, in which case we know
> +     that start != last as we had more data to copy but the buffer
> +     was full.  */
> +  if (!(res = copy (start, last)))
> +    __builtin_abort ();
> +
> +  /* Check if we have the right reduction value.  */
> +  if (res != &tmp[NUM-1])
> +    __builtin_abort ();
> +
> +  int store_size = sizeof(PV);
> +#pragma GCC novector
> +  for (int i = 0; i < NUM - 1; i+=store_size)
> +    if (0 != __builtin_memcmp (buffer+i, (char*)&tmp[i].Val, store_size))
> +      __builtin_abort ();
> +
> +  return 0;
> +}
> +
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 
> 4375ebdcb493a90fd0501cbb4b07466077b525c3..024a24a305c4727f97eb022247f4dca791c52dfe
>  100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -12144,6 +12144,12 @@ vect_transform_loop (loop_vec_info loop_vinfo, 
> gimple *loop_vectorized_call)
>       -min_epilogue_iters to remove iterations that cannot be performed
>         by the vector code.  */
>    int bias_for_lowest = 1 - min_epilogue_iters;
> +  /* For an early break we must always assume that the vector loop can be
> +     executed partially.  In this definition a partial iteration means that 
> we
> +     take an exit before the IV exit.  */
> +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> +    bias_for_lowest = 1;
> +
>    int bias_for_assumed = bias_for_lowest;
>    int alignment_npeels = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
>    if (alignment_npeels && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
> 
> 
> 
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to