Richard Biener <rguent...@suse.de> writes: > On Fri, 12 Apr 2024, Tamar Christina wrote: > >> Hi All, >> >> This is a story all about how the peeling for gaps introduces a bug in the >> upper >> bounds. >> >> Before I go further, I'll first explain how I understand this to work for >> loops >> with a single exit. >> >> When peeling for gaps we peel N < VF iterations to scalar. >> This happens by removing N iterations from the calculation of niters such >> that >> vect_iters * VF == niters is always false. >> >> In other words, when we exit the vector loop we always fall to the scalar >> loop. >> The loop bounds adjustment guarantees this. Because of this we potentially >> execute a vector loop iteration less. That is, if you're at the boundary >> condition where niters % VF by peeling one or more scalar iterations the >> vector >> loop executes one less. >> >> This is accounted for by the adjustments in vect_transform_loops. This >> adjustment happens differently based on whether the the vector loop can be >> partial or not: >> >> Peeling for gaps sets the bias to 0 and then: >> >> when not partial: we take the floor of (scalar_upper_bound / VF) - 1 to get >> the >> vector latch iteration count. >> >> when loop is partial: For a single exit this means the loop is masked, we >> take >> the ceil to account for the fact that the loop can >> handle >> the final partial iteration using masking. >> >> Note that there's no difference between ceil an floor on the boundary >> condition. >> There is a difference however when you're slightly above it. i.e. if scalar >> iterates 14 times and VF = 4 and we peel 1 iteration for gaps. >> >> The partial loop does ((13 + 0) / 4) - 1 == 2 vector iterations. and in >> effect >> the partial iteration is ignored and it's done as scalar. >> >> This is fine because the niters modification has capped the vector iteration >> at >> 2. So that when we reduce the induction values you end up entering the >> scalar >> code with ind_var.2 = ind_var.1 + 2 * VF. >> >> Now lets look at early breaks. To make it esier I'll focus on the specific >> testcase: >> >> char buffer[64]; >> >> __attribute__ ((noipa)) >> buff_t *copy (buff_t *first, buff_t *last) >> { >> char *buffer_ptr = buffer; >> char *const buffer_end = &buffer[SZ-1]; >> int store_size = sizeof(first->Val); >> while (first != last && (buffer_ptr + store_size) <= buffer_end) >> { >> const char *value_data = (const char *)(&first->Val); >> __builtin_memcpy(buffer_ptr, value_data, store_size); >> buffer_ptr += store_size; >> ++first; >> } >> >> if (first == last) >> return 0; >> >> return first; >> } >> >> Here the first, early exit is on the condition: >> >> (buffer_ptr + store_size) <= buffer_end >> >> and the main exit is on condition: >> >> first != last >> >> This is important, as this bug only manifests itself when the first exit has >> a >> known constant iteration count that's lower than the latch exit count. >> >> because buffer holds 64 bytes, and VF = 4, unroll = 2, we end up processing >> 16 >> bytes per iteration. So the exit has a known bounds of 8 + 1. >> >> The vectorizer correctly analizes this: >> >> Statement (exit)if (ivtmp_21 != 0) >> is executed at most 8 (bounded by 8) + 1 times in loop 1. >> >> and as a consequence the IV is bound by 9: >> >> # vect_vec_iv_.14_117 = PHI <_118(9), { 9, 8, 7, 6 }(20)> >> ... >> vect_ivtmp_21.16_124 = vect_vec_iv_.14_117 + { 18446744073709551615, >> 18446744073709551615, 18446744073709551615, 18446744073709551615 }; >> mask_patt_22.17_126 = vect_ivtmp_21.16_124 != { 0, 0, 0, 0 }; >> if (mask_patt_22.17_126 == { -1, -1, -1, -1 }) >> goto <bb 3>; [88.89%] >> else >> goto <bb 30>; [11.11%] >> >> The imporant bits are this: >> >> In this example the value of last - first = 416. >> >> the calculated vector iteration count, is: >> >> x = (((ptr2 - ptr1) - 16) / 16) + 1 = 27 >> >> the bounds generated, adjusting for gaps: >> >> x == (((x - 1) >> 2) << 2) >> >> which means we'll always fall through to the scalar code. as intended. >> >> Here are two key things to note: >> >> 1. In this loop, the early exit will always be the one taken. When it's >> taken >> we enter the scalar loop with the correct induction value to apply the gap >> peeling. >> >> 2. If the main exit is taken, the induction values assumes you've finished >> all >> vector iterations. i.e. it assumes you have completed 24 iterations, as >> we >> treat the main exit the same for normal loop vect and early break when not >> PEELED. >> This means the induction value is adjusted to ind_var.2 = ind_var.1 + 24 >> * VF; >> >> So what's going wrong. The vectorizer's codegen is correct and efficient, >> however when we adjust the upper bounds, that code knows that the loops upper >> bound is based on the early exit. i.e. 8 latch iterations. or in other words. >> It thinks the loop iterates once. >> >> This is incorrect as the vector loop iterates twice, as it has set up the >> induction value such that it exits at the early exit. So it in effect >> iterates >> 2.5x times. >> >> Becuase the upper bound is incorrect, when we unroll it now exits from the >> main >> exit which uses the incorrect induction value. >> >> So there are three ways to fix this: >> >> 1. If we take the position that the main exit should support both premature >> exits and final exits then vect_update_ivs_after_vectorizer needs to be >> skipped for this case, and vectorizable_induction updated with third >> case >> where we reduce with LAST reduction based on the IVs instead of assuming >> you're at the end of the vector loop. >> >> I don't like this approach. It don't think we should add a third >> induction >> style to cover up an issue introduced by unrolling. It makes the code >> harder to follow and makes main exits harder to reason about. >> >> 2. We could say that vec_init_loop_exit_info should pick the exit which has >> the >> smallest known iteration count. This would turn this case into a PEELED >> case >> and the induction values would be correct as we'd always recalculate them >> from a reduction. This is suboptimal though as the reason we pick the >> latch >> exit as the IV one is to prevent having to rotate the loop. This results >> in more efficient code for what we assume is the common case, i.e. the >> main >> exit. >> >> 3. In PR113734 we've established that for vectorization of early breaks that >> we >> must always treat the loop as partial. Here partiallity means that we >> have >> enough vector elements to start the iteration, but we may take an early >> exit >> and so never reach the latch/main exit. >> >> This requirement is overwritten by the peeling for gaps adjustment of the >> upper bound. I believe the bug is simply that this shouldn't be done. >> The adjustment here is to indicate that the main exit always leads to the >> scalar loop when peeling for gaps. >> >> But this invariant is already always true for all early exits. Remember >> that >> early exits restart the scalar loop at the start of the vector iteration, >> so >> the induction values will start it where we want to do the gaps peeling. >> >> I think no# 3 is the correct fix, and also one that doesn't degrade code >> quality. >> >> Note: I used memcpy and memcmp in the testcase, I'm not sure if I can rely on >> these being inlined? but I also don't know how to test for library >> support. >> >> Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu and >> no issues. >> >> Ok for master? > > I agree with the patch outcome. I think the reasoning is simpler > because peeling for gaps applies to the scalar iteration IV but > with multiple exits an upper bound can effectively apply to a > different exit and thus a different "IV" we cannot simply adjust > an existing upper bound as if it were a bound for the scalar iteration IV. > > So I think we should word it like the following. Can you pick that > up and also add the other testcase I produced? > > OK with those changes. > > Thanks, > Richard. > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > index 3ffcac8c613..c728856079b 100644 > --- a/gcc/tree-vect-loop.cc > +++ b/gcc/tree-vect-loop.cc > @@ -12152,14 +12152,14 @@ vect_transform_loop (loop_vec_info loop_vinfo, > gimple > *loop_vectorized_call) > bool final_iter_may_be_partial > = LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) > || LOOP_VINFO_EARLY_BREAKS (loop_vinfo); > - /* The minimum number of iterations performed by the epilogue. This > - is 1 when peeling for gaps because we always need a final scalar > - iteration. */ > - int min_epilogue_iters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : > 0; > - /* +1 to convert latch counts to loop iteration counts, > - -min_epilogue_iters to remove iterations that cannot be performed > - by the vector code. */ > - int bias_for_lowest = 1 - min_epilogue_iters; > + /* +1 to convert latch counts to loop iteration counts. */ > + int bias_for_lowest = 1; > + /* When we are peeling for gaps then we take away one scalar iteration > + from the vector loop. Thus we can adjust the upper bound by one > + scalar iteration. But only when we know the bound applies to the > + IV exit test which might not be true when we have multiple exits. > */ > + if (! LOOP_VINFO_EARLY_BREAKS (loop_vinfo)) > + bias_for_lowest -= LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0; > int bias_for_assumed = bias_for_lowest; > int alignment_npeels = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo); > if (alignment_npeels && LOOP_VINFO_USING_PARTIAL_VECTORS_P > (loop_vinfo))
Ah, nice. As the person guilty of writing the current form of the code, I think the spltting out of min_epilogue_iters was anticipating some future feature (that never materialised) in which the value might not be binary. I agree that it's no longer worth keeping it separate after this adjustment, since it's not clear how far it would generalise to values greater than 1. And yeah, in the irc discussion, I hadn't realised that the "early" and "IV" exits had been rotated. Thanks, Richard