On Wed, Jul 24, 2024 at 1:31 AM Edwin Lu <e...@rivosinc.com> wrote: > > > On 7/23/2024 11:20 AM, Richard Sandiford wrote: > > Edwin Lu <e...@rivosinc.com> writes: > >> On 7/23/2024 4:56 AM, Richard Biener wrote: > >>> On Tue, Jul 23, 2024 at 1:03 AM Edwin Lu <e...@rivosinc.com> wrote: > >>>> Hi Richard, > >>>> > >>>> On 5/31/2024 1:48 AM, Richard Biener wrote: > >>>>> On Thu, May 30, 2024 at 2:11 AM Patrick O'Neill <patr...@rivosinc.com> > >>>>> wrote: > >>>>>> From: Greg McGary <g...@rivosinc.com> > >>>>> Still a NACK. If remain ends up zero then > >>>>> > >>>>> /* Try to use a single smaller load when we are > >>>>> about > >>>>> to load excess elements compared to the > >>>>> unrolled > >>>>> scalar loop. */ > >>>>> if (known_gt ((vec_num * j + i + 1) * nunits, > >>>>> (group_size * vf - gap))) > >>>>> { > >>>>> poly_uint64 remain = ((group_size * vf - gap) > >>>>> - (vec_num * j + i) * > >>>>> nunits); > >>>>> if (known_ge ((vec_num * j + i + 1) * nunits > >>>>> - (group_size * vf - gap), > >>>>> nunits)) > >>>>> /* DR will be unused. */ > >>>>> ltype = NULL_TREE; > >>>>> > >>>>> needs to be re-formulated so that the combined conditions make sure > >>>>> this doesn't happen. The outer known_gt should already ensure that > >>>>> remain > 0. For correctness that should possibly be maybe_gt though. > > Yeah. FWIW, I mentioned the maybe_gt thing in > > https://gcc.gnu.org/pipermail/gcc-patches/2024-May/653013.html: > > > > Pre-existing, but shouldn't this be maybe_gt rather than known_gt? > > We can only skip doing it if we know for sure that the load won't cross > > the gap. (Not sure whether the difference can trigger in practice.) > > > > But AFAICT, the known_gt doesn't inherently prove that remain is known > > to be nonzero. It just proves that the gap between the end of the scalar > > accesses and the end of this vector is known to be nonzero. > > > >>> Putting the list back in the loop and CCing Richard S. > >>> > >>>> I'm currently looking into this patch and am trying to figure out what > >>>> is going on. Stepping through gdb, I see that remain == {coeffs = {0, > >>>> 2}} and nunits == {coeffs = {2, 2}} (the outer known_gt returned true > >>>> with known_gt({coeffs = {8, 8}}, {coeffs = {6, 8}})). > >>>> > >>>> From what I understand, this falls under the umbrella of 0 <= remain < > >>>> nunits. The divide by zero error is because of the 0 <= remain which is > >>>> coming from the constant_multiple_p function in poly-int.h where it > >>>> performs the modulus NCa(a.coeffs[0]) % NCb(b.coeffs[0]). > >>>> (https://github.com/gcc-mirror/gcc/blob/master/gcc/poly-int.h#L1970-L1971) > >>>> > >>>> > >>>> > if (known_ge ((vec_num * j + i + 1) * > >>>> nunits > >>>> > - (group_size * vf - gap), > >>>> nunits)) > >>>> > /* DR will be unused. */ > >>>> > ltype = NULL_TREE; > >>>> > >>>> This if condition is a bit suspicious to me though. I'm seeing that it's > >>>> evaluating known_ge({coeffs = {2, 0}}, {coeffs = {2, 2}}) which is > >>>> returning false. Should it be maybe_ge instead? > >>> No, we can only not emit a load if we know it won't be used, not if > >>> it eventually cannot be used. > > Agreed. > > > > [switching round for easier reply] > >>>> After running some > >>>> tests, to me it looks like it doesn't vectorize quite as often; however, > >>>> I'm not fully sure what else to do when the coeffs can potentially be > >>>> equal to 0. > >>>> > >>>> Should it even be possible for there to be a {coeffs = {0, n}} > >>>> situation? My understanding of how poly_ints are used for representing > >>>> vectorization is that the first coefficient is the number of elements > >>>> needed to make the minimum supported vector size. That is, if vector > >>>> lengths are 128 bits, element size is 32 bits, coeff[0] should be > >>>> minimum of 4. Is this understanding correct? > >>> I was told n can be negative, but nunits.coeff[0] should be non-zero. > >> What would it mean for the coeffs[0] to be 0? Would that mean the vector > >> length supports 0 bits? > > coeffs = {A,B} just means A+B*X, where X is the number of vector > > "chunks" beyond the minimum length. It's certainly valid for a poly_int > > to have a zero coeffs[0] (i.e. zero A). For example, (the length of a > > vector) - (the minimum length) would have this property. > > Thanks for the explanation! I have a few clarification questions about this. > If I understand correctly, B would represent the number of elements the > vector can have (for 128b vector operating on 32b elements, B == 4, but if > operating on 64b elements B == 2); however, I'm not too sure what A > represents. > > On the poly_int docs, it says > > An indeterminate value of 0 should usually represent the minimum possible > > runtime value, with c0 specifying the value in that case. > "minimum possible runtime value" doesn't make sense to me. Does it mean the > potential minimum bound of elements it will operate on? > > >>> What is j and i when the divisor is zero? > >> The values I see in gdb are: vec_num = 4 j = 0 i = 3 vf = {coeffs = {2, > >> 2}} nunits = {coeffs = {2, 2}} group_size = 4 gap = 2 vect_align = 2 > >> remain = {coeffs = {0, 2}} > > OK, so let's use D to mean "data" and G to mean "gap". Then, for the > > minimum vector length of 2 elements, we have: > > > > DD GG DD GG > > > > The last load will read beyond the scalar loop if the vector loop happens > > to handle all elements of the scalar loop. > > > > For a vector length of 4 elements, we have: > > > > DDGG DDGG DDGG DDGG > > > > where every load contains both data and gaps. The same will be true > > for larger vectors. > > > > That's where remain={0,2} is coming from. The last load is fully redundant > > for the minimum VL but not for larger VL. > > I think I understand the example but just want to see if I fully understand. > If we have 7 data elements with minimum vector length of 2 elements, the > vector loop would essentially run 3 times on > > DD GG DD GG DD GG > > There is an extra Data element which hasn't been processed. Since there is > one element left over and is less than the minimum vector length, we can > either > 1. run scalar execution of the number of remaining elements (finishing scalar > loop) or > 2. see if a smaller vector size can process it nicely (i.e. minimum vector > length of 1 element in this case. remain now is {1, 1}) or > 3. use the same vector size but mask out the remaining element > > Is this a correct? > > > Based on that, the patch below looks correct to me, but I might have > > misunderstood the intent. > > As an alternative to the original patch, would this also make sense? > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > index aab3aa59962..cd657ac63af 100644 > --- a/gcc/tree-vect-stmts.cc > +++ b/gcc/tree-vect-stmts.cc > @@ -11479,7 +11479,7 @@ vectorizable_load (vec_info *vinfo, > /* Try to use a single smaller load when we are about > to load excess elements compared to the unrolled > scalar loop. */ > - if (known_gt ((vec_num * j + i + 1) * nunits, > + if (maybe_gt ((vec_num * j + i + 1) * nunits, > (group_size * vf - gap))) > { > poly_uint64 remain = ((group_size * vf - gap)
That's a good point - this should indeed be maybe_gt > @@ -11502,6 +11502,10 @@ vectorizable_load (vec_info *vinfo, > /* Aligned access to the gap area when there's > at least one element in it is OK. */ > ; > + else if (maybe_eq (remain, 0)) > + /* Handle remain.coeffs[0] == 0 case. Number of > + elements is an exact multiple of vector length. > */ > + ; Hmm, but here it should be known_eq? And why doesn't constant_multiple_p work then? I think remain.coeffs[1] is never negative here, but that would be another way to arrive at zero. So in principle this would amend if (known_ge ((vec_num * j + i + 1) * nunits - (group_size * vf - gap), nunits)) /* DR will be unused. */ ltype = NULL_TREE; ? Sure the original patch would work, but then the corresponding change in get_group_loadstore_type needs to be done to ensure if remain maybe zero we're forcing peeling for gaps, no? > else > { > /* remain should now be > 0 and < nunits. */ > > > Thanks, > > Richard > > Edwin > > >>>>>> gcc/ChangeLog: > >>>>>> * gcc/tree-vect-stmts.cc (gcc/tree-vect-stmts.cc): Prevent > >>>>>> divide-by-zero. > >>>>>> * testsuite/gcc.target/riscv/rvv/autovec/no-segment.c: > >>>>>> Remove dg-ice. > >>>>>> --- > >>>>>> No changes in v3. Depends on the risc-v backend option added in patch > >>>>>> 1 to > >>>>>> trigger the ICE. > >>>>>> --- > >>>>>> gcc/testsuite/gcc.target/riscv/rvv/autovec/no-segment.c | 1 - > >>>>>> gcc/tree-vect-stmts.cc | 3 ++- > >>>>>> 2 files changed, 2 insertions(+), 2 deletions(-) > >>>>>> > >>>>>> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/no-segment.c > >>>>>> b/gcc/testsuite/gcc.target/riscv/rvv/autovec/no-segment.c > >>>>>> index dfbe09f01a1..79d03612a22 100644 > >>>>>> --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/no-segment.c > >>>>>> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/no-segment.c > >>>>>> @@ -1,6 +1,5 @@ > >>>>>> /* { dg-do compile } */ > >>>>>> /* { dg-options "-march=rv64gcv -mabi=lp64d > >>>>>> -mrvv-vector-bits=scalable -O3 -mno-autovec-segment" } */ > >>>>>> -/* { dg-ice "Floating point exception" } */ > >>>>>> > >>>>>> enum e { c, d }; > >>>>>> enum g { f }; > >>>>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > >>>>>> index 4219ad832db..34f5736ba00 100644 > >>>>>> --- a/gcc/tree-vect-stmts.cc > >>>>>> +++ b/gcc/tree-vect-stmts.cc > >>>>>> @@ -11558,7 +11558,8 @@ vectorizable_load (vec_info *vinfo, > >>>>>> - (vec_num * j + i) * nunits); > >>>>>> /* remain should now be > 0 and < > >>>>>> nunits. */ > >>>>>> unsigned num; > >>>>>> - if (constant_multiple_p (nunits, remain, > >>>>>> &num)) > >>>>>> + if (known_gt (remain, 0) > >>>>>> + && constant_multiple_p (nunits, > >>>>>> remain, &num)) > >>>>>> { > >>>>>> tree ptype; > >>>>>> new_vtype > > > > > >>>>>> -- > >>>>>> 2.43.2 > >>>>>> > >>>> Edwin > >>>>