On Fri, 12 Apr 2024, Tamar Christina wrote: > Hi All, > > This is a story all about how the peeling for gaps introduces a bug in the > upper > bounds. > > Before I go further, I'll first explain how I understand this to work for > loops > with a single exit. > > When peeling for gaps we peel N < VF iterations to scalar. > This happens by removing N iterations from the calculation of niters such that > vect_iters * VF == niters is always false. > > In other words, when we exit the vector loop we always fall to the scalar > loop. > The loop bounds adjustment guarantees this. Because of this we potentially > execute a vector loop iteration less. That is, if you're at the boundary > condition where niters % VF by peeling one or more scalar iterations the > vector > loop executes one less. > > This is accounted for by the adjustments in vect_transform_loops. This > adjustment happens differently based on whether the the vector loop can be > partial or not: > > Peeling for gaps sets the bias to 0 and then: > > when not partial: we take the floor of (scalar_upper_bound / VF) - 1 to get > the > vector latch iteration count. > > when loop is partial: For a single exit this means the loop is masked, we > take > the ceil to account for the fact that the loop can > handle > the final partial iteration using masking. > > Note that there's no difference between ceil an floor on the boundary > condition. > There is a difference however when you're slightly above it. i.e. if scalar > iterates 14 times and VF = 4 and we peel 1 iteration for gaps. > > The partial loop does ((13 + 0) / 4) - 1 == 2 vector iterations. and in effect > the partial iteration is ignored and it's done as scalar. > > This is fine because the niters modification has capped the vector iteration > at > 2. So that when we reduce the induction values you end up entering the scalar > code with ind_var.2 = ind_var.1 + 2 * VF. > > Now lets look at early breaks. To make it esier I'll focus on the specific > testcase: > > char buffer[64]; > > __attribute__ ((noipa)) > buff_t *copy (buff_t *first, buff_t *last) > { > char *buffer_ptr = buffer; > char *const buffer_end = &buffer[SZ-1]; > int store_size = sizeof(first->Val); > while (first != last && (buffer_ptr + store_size) <= buffer_end) > { > const char *value_data = (const char *)(&first->Val); > __builtin_memcpy(buffer_ptr, value_data, store_size); > buffer_ptr += store_size; > ++first; > } > > if (first == last) > return 0; > > return first; > } > > Here the first, early exit is on the condition: > > (buffer_ptr + store_size) <= buffer_end > > and the main exit is on condition: > > first != last > > This is important, as this bug only manifests itself when the first exit has a > known constant iteration count that's lower than the latch exit count. > > because buffer holds 64 bytes, and VF = 4, unroll = 2, we end up processing 16 > bytes per iteration. So the exit has a known bounds of 8 + 1. > > The vectorizer correctly analizes this: > > Statement (exit)if (ivtmp_21 != 0) > is executed at most 8 (bounded by 8) + 1 times in loop 1. > > and as a consequence the IV is bound by 9: > > # vect_vec_iv_.14_117 = PHI <_118(9), { 9, 8, 7, 6 }(20)> > ... > vect_ivtmp_21.16_124 = vect_vec_iv_.14_117 + { 18446744073709551615, > 18446744073709551615, 18446744073709551615, 18446744073709551615 }; > mask_patt_22.17_126 = vect_ivtmp_21.16_124 != { 0, 0, 0, 0 }; > if (mask_patt_22.17_126 == { -1, -1, -1, -1 }) > goto <bb 3>; [88.89%] > else > goto <bb 30>; [11.11%] > > The imporant bits are this: > > In this example the value of last - first = 416. > > the calculated vector iteration count, is: > > x = (((ptr2 - ptr1) - 16) / 16) + 1 = 27 > > the bounds generated, adjusting for gaps: > > x == (((x - 1) >> 2) << 2) > > which means we'll always fall through to the scalar code. as intended. > > Here are two key things to note: > > 1. In this loop, the early exit will always be the one taken. When it's taken > we enter the scalar loop with the correct induction value to apply the gap > peeling. > > 2. If the main exit is taken, the induction values assumes you've finished all > vector iterations. i.e. it assumes you have completed 24 iterations, as we > treat the main exit the same for normal loop vect and early break when not > PEELED. > This means the induction value is adjusted to ind_var.2 = ind_var.1 + 24 * > VF; > > So what's going wrong. The vectorizer's codegen is correct and efficient, > however when we adjust the upper bounds, that code knows that the loops upper > bound is based on the early exit. i.e. 8 latch iterations. or in other words. > It thinks the loop iterates once. > > This is incorrect as the vector loop iterates twice, as it has set up the > induction value such that it exits at the early exit. So it in effect > iterates > 2.5x times. > > Becuase the upper bound is incorrect, when we unroll it now exits from the > main > exit which uses the incorrect induction value. > > So there are three ways to fix this: > > 1. If we take the position that the main exit should support both premature > exits and final exits then vect_update_ivs_after_vectorizer needs to be > skipped for this case, and vectorizable_induction updated with third case > where we reduce with LAST reduction based on the IVs instead of assuming > you're at the end of the vector loop. > > I don't like this approach. It don't think we should add a third > induction > style to cover up an issue introduced by unrolling. It makes the code > harder to follow and makes main exits harder to reason about. > > 2. We could say that vec_init_loop_exit_info should pick the exit which has > the > smallest known iteration count. This would turn this case into a PEELED > case > and the induction values would be correct as we'd always recalculate them > from a reduction. This is suboptimal though as the reason we pick the > latch > exit as the IV one is to prevent having to rotate the loop. This results > in more efficient code for what we assume is the common case, i.e. the main > exit. > > 3. In PR113734 we've established that for vectorization of early breaks that > we > must always treat the loop as partial. Here partiallity means that we have > enough vector elements to start the iteration, but we may take an early > exit > and so never reach the latch/main exit. > > This requirement is overwritten by the peeling for gaps adjustment of the > upper bound. I believe the bug is simply that this shouldn't be done. > The adjustment here is to indicate that the main exit always leads to the > scalar loop when peeling for gaps. > > But this invariant is already always true for all early exits. Remember > that > early exits restart the scalar loop at the start of the vector iteration, > so > the induction values will start it where we want to do the gaps peeling. > > I think no# 3 is the correct fix, and also one that doesn't degrade code > quality. > > Note: I used memcpy and memcmp in the testcase, I'm not sure if I can rely on > these being inlined? but I also don't know how to test for library > support. > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu and > no issues. > > Ok for master?
I agree with the patch outcome. I think the reasoning is simpler because peeling for gaps applies to the scalar iteration IV but with multiple exits an upper bound can effectively apply to a different exit and thus a different "IV" we cannot simply adjust an existing upper bound as if it were a bound for the scalar iteration IV. So I think we should word it like the following. Can you pick that up and also add the other testcase I produced? OK with those changes. Thanks, Richard. diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 3ffcac8c613..c728856079b 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -12152,14 +12152,14 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call) bool final_iter_may_be_partial = LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) || LOOP_VINFO_EARLY_BREAKS (loop_vinfo); - /* The minimum number of iterations performed by the epilogue. This - is 1 when peeling for gaps because we always need a final scalar - iteration. */ - int min_epilogue_iters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0; - /* +1 to convert latch counts to loop iteration counts, - -min_epilogue_iters to remove iterations that cannot be performed - by the vector code. */ - int bias_for_lowest = 1 - min_epilogue_iters; + /* +1 to convert latch counts to loop iteration counts. */ + int bias_for_lowest = 1; + /* When we are peeling for gaps then we take away one scalar iteration + from the vector loop. Thus we can adjust the upper bound by one + scalar iteration. But only when we know the bound applies to the + IV exit test which might not be true when we have multiple exits. */ + if (! LOOP_VINFO_EARLY_BREAKS (loop_vinfo)) + bias_for_lowest -= LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0; int bias_for_assumed = bias_for_lowest; int alignment_npeels = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo); if (alignment_npeels && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) > Thanks, > Tamar > > gcc/ChangeLog: > > PR tree-optimization/114403 > * tree-vect-loop.cc (vect_transform_loop): Adjust upper bounds for when > peeling for gaps and early break. > > gcc/testsuite/ChangeLog: > > PR tree-optimization/114403 > * gcc.dg/vect/vect-early-break_124-pr114403.c: New test. > > --- > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c > b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c > new file mode 100644 > index > 0000000000000000000000000000000000000000..ae5e53efc45e7bef89c5a72abd6afa48292668db > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c > @@ -0,0 +1,74 @@ > +/* { dg-add-options vect_early_break } */ > +/* { dg-require-effective-target vect_early_break_hw } */ > +/* { dg-require-effective-target vect_long_long } */ > + > +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ > + > +#include "tree-vect.h" > + > +typedef unsigned long PV; > +typedef struct _buff_t { > + int foo; > + PV Val; > +} buff_t; > + > +#define NUM 9 > +#define SZ NUM * sizeof (PV) > +char buffer[SZ]; > + > +__attribute__ ((noipa)) > +buff_t *copy (buff_t *first, buff_t *last) > +{ > + char *buffer_ptr = buffer; > + char *const buffer_end = &buffer[SZ-1]; > + int store_size = sizeof(first->Val); > + while (first != last && (buffer_ptr + store_size) <= buffer_end) > + { > + const char *value_data = (const char *)(&first->Val); > + __builtin_memcpy(buffer_ptr, value_data, store_size); > + buffer_ptr += store_size; > + ++first; > + } > + > + if (first == last) > + return 0; > + > + return first; > +} > + > +int main () > +{ > + /* Copy an ascii buffer. We need to trigger the loop to exit from > + the condition where we have more data to copy but not enough space. > + For this test that means that OVL must be > SZ. */ > +#define OVL NUM*2 > + char str[OVL]="abcdefghiabcdefgh\0"; > + buff_t tmp[OVL]; > + > +#pragma GCC novector > + for (int i = 0; i < OVL; i++) > + tmp[i].Val = str[i]; > + > + buff_t *start = &tmp[0]; > + buff_t *last = &tmp[OVL-1]; > + buff_t *res = 0; > + > + /* This copy should exit on the early exit, in which case we know > + that start != last as we had more data to copy but the buffer > + was full. */ > + if (!(res = copy (start, last))) > + __builtin_abort (); > + > + /* Check if we have the right reduction value. */ > + if (res != &tmp[NUM-1]) > + __builtin_abort (); > + > + int store_size = sizeof(PV); > +#pragma GCC novector > + for (int i = 0; i < NUM - 1; i+=store_size) > + if (0 != __builtin_memcmp (buffer+i, (char*)&tmp[i].Val, store_size)) > + __builtin_abort (); > + > + return 0; > +} > + > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > index > 4375ebdcb493a90fd0501cbb4b07466077b525c3..024a24a305c4727f97eb022247f4dca791c52dfe > 100644 > --- a/gcc/tree-vect-loop.cc > +++ b/gcc/tree-vect-loop.cc > @@ -12144,6 +12144,12 @@ vect_transform_loop (loop_vec_info loop_vinfo, > gimple *loop_vectorized_call) > -min_epilogue_iters to remove iterations that cannot be performed > by the vector code. */ > int bias_for_lowest = 1 - min_epilogue_iters; > + /* For an early break we must always assume that the vector loop can be > + executed partially. In this definition a partial iteration means that > we > + take an exit before the IV exit. */ > + if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)) > + bias_for_lowest = 1; > + > int bias_for_assumed = bias_for_lowest; > int alignment_npeels = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo); > if (alignment_npeels && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) > > > > > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)