On Wed, Jul 24, 2024 at 1:31 AM Edwin Lu <e...@rivosinc.com> wrote:
>
>
> On 7/23/2024 11:20 AM, Richard Sandiford wrote:
> > Edwin Lu <e...@rivosinc.com> writes:
> >> On 7/23/2024 4:56 AM, Richard Biener wrote:
> >>> On Tue, Jul 23, 2024 at 1:03 AM Edwin Lu <e...@rivosinc.com> wrote:
> >>>> Hi Richard,
> >>>>
> >>>> On 5/31/2024 1:48 AM, Richard Biener wrote:
> >>>>> On Thu, May 30, 2024 at 2:11 AM Patrick O'Neill <patr...@rivosinc.com> 
> >>>>> wrote:
> >>>>>> From: Greg McGary <g...@rivosinc.com>
> >>>>> Still a NACK.  If remain ends up zero then
> >>>>>
> >>>>>                        /* Try to use a single smaller load when we are 
> >>>>> about
> >>>>>                           to load excess elements compared to the 
> >>>>> unrolled
> >>>>>                           scalar loop.  */
> >>>>>                        if (known_gt ((vec_num * j + i + 1) * nunits,
> >>>>>                                           (group_size * vf - gap)))
> >>>>>                          {
> >>>>>                            poly_uint64 remain = ((group_size * vf - gap)
> >>>>>                                                  - (vec_num * j + i) * 
> >>>>> nunits);
> >>>>>                            if (known_ge ((vec_num * j + i + 1) * nunits
> >>>>>                                          - (group_size * vf - gap), 
> >>>>> nunits))
> >>>>>                              /* DR will be unused.  */
> >>>>>                              ltype = NULL_TREE;
> >>>>>
> >>>>> needs to be re-formulated so that the combined conditions make sure
> >>>>> this doesn't happen.  The outer known_gt should already ensure that
> >>>>> remain > 0.  For correctness that should possibly be maybe_gt though.
> > Yeah.  FWIW, I mentioned the maybe_gt thing in
> > https://gcc.gnu.org/pipermail/gcc-patches/2024-May/653013.html:
> >
> >    Pre-existing, but shouldn't this be maybe_gt rather than known_gt?
> >    We can only skip doing it if we know for sure that the load won't cross
> >    the gap.  (Not sure whether the difference can trigger in practice.)
> >
> > But AFAICT, the known_gt doesn't inherently prove that remain is known
> > to be nonzero.  It just proves that the gap between the end of the scalar
> > accesses and the end of this vector is known to be nonzero.
> >
> >>> Putting the list back in the loop and CCing Richard S.
> >>>
> >>>> I'm currently looking into this patch and am trying to figure out what
> >>>> is going on. Stepping through gdb, I see that remain == {coeffs = {0,
> >>>> 2}} and nunits == {coeffs = {2, 2}} (the outer known_gt returned true
> >>>> with known_gt({coeffs = {8, 8}}, {coeffs = {6, 8}})).
> >>>>
> >>>>    From what I understand, this falls under the umbrella of 0 <= remain <
> >>>> nunits. The divide by zero error is because of the 0 <= remain which is
> >>>> coming from the constant_multiple_p function in poly-int.h where it
> >>>> performs the modulus NCa(a.coeffs[0]) % NCb(b.coeffs[0]).
> >>>> (https://github.com/gcc-mirror/gcc/blob/master/gcc/poly-int.h#L1970-L1971)
> >>>>
> >>>>
> >>>>    >                          if (known_ge ((vec_num * j + i + 1) * 
> >>>> nunits
> >>>>    >                                        - (group_size * vf - gap),
> >>>> nunits))
> >>>>    >                            /* DR will be unused.  */
> >>>>    >                            ltype = NULL_TREE;
> >>>>
> >>>> This if condition is a bit suspicious to me though. I'm seeing that it's
> >>>> evaluating known_ge({coeffs = {2, 0}}, {coeffs = {2, 2}}) which is
> >>>> returning false. Should it be maybe_ge instead?
> >>> No, we can only not emit a load if we know it won't be used, not if
> >>> it eventually cannot be used.
> > Agreed.
> >
> > [switching round for easier reply]
> >>>> After running some
> >>>> tests, to me it looks like it doesn't vectorize quite as often; however,
> >>>> I'm not fully sure what else to do when the coeffs can potentially be
> >>>> equal to 0.
> >>>>
> >>>> Should it even be possible for there to be a {coeffs = {0, n}}
> >>>> situation? My understanding of how poly_ints are used for representing
> >>>> vectorization is that the first coefficient is the number of elements
> >>>> needed to make the minimum supported vector size. That is, if vector
> >>>> lengths are 128 bits, element size is 32 bits, coeff[0] should be
> >>>> minimum of 4. Is this understanding correct?
> >>> I was told n can be negative, but nunits.coeff[0] should be non-zero.
> >> What would it mean for the coeffs[0] to be 0? Would that mean the vector 
> >> length supports 0 bits?
> > coeffs = {A,B} just means A+B*X, where X is the number of vector
> > "chunks" beyond the minimum length.  It's certainly valid for a poly_int
> > to have a zero coeffs[0] (i.e. zero A).  For example, (the length of a
> > vector) - (the minimum length) would have this property.
>
> Thanks for the explanation! I have a few clarification questions about this.
> If I understand correctly, B would represent the number of elements the 
> vector can have (for 128b vector operating on 32b elements, B == 4, but if 
> operating on 64b elements B == 2); however, I'm not too sure what A 
> represents.
>
> On the poly_int docs, it says
> > An indeterminate value of 0 should usually represent the minimum possible 
> > runtime value, with c0 specifying the value in that case.
> "minimum possible runtime value" doesn't make sense to me. Does it mean the 
> potential minimum bound of elements it will operate on?
>
> >>> What is j and i when the divisor is zero?
> >> The values I see in gdb are: vec_num = 4 j = 0 i = 3 vf = {coeffs = {2,
> >> 2}} nunits = {coeffs = {2, 2}} group_size = 4 gap = 2 vect_align = 2
> >> remain = {coeffs = {0, 2}}
> > OK, so let's use D to mean "data" and G to mean "gap".  Then, for the
> > minimum vector length of 2 elements, we have:
> >
> >    DD GG DD GG
> >
> > The last load will read beyond the scalar loop if the vector loop happens
> > to handle all elements of the scalar loop.
> >
> > For a vector length of 4 elements, we have:
> >
> >    DDGG DDGG DDGG DDGG
> >
> > where every load contains both data and gaps.  The same will be true
> > for larger vectors.
> >
> > That's where remain={0,2} is coming from.  The last load is fully redundant
> > for the minimum VL but not for larger VL.
>
> I think I understand the example but just want to see if I fully understand.
> If we have 7 data elements with minimum vector length of 2 elements, the 
> vector loop would essentially run 3 times on
>
>    DD GG DD GG DD GG
>
> There is an extra Data element which hasn't been processed. Since there is 
> one element left over and is less than the minimum vector length, we can 
> either
> 1. run scalar execution of the number of remaining elements (finishing scalar 
> loop) or
> 2. see if a smaller vector size can process it nicely (i.e. minimum vector 
> length of 1 element in this case. remain now is {1, 1}) or
> 3. use the same vector size but mask out the remaining element
>
> Is this a correct?
>
> > Based on that, the patch below looks correct to me, but I might have
> > misunderstood the intent.
>
> As an alternative to the original patch, would this also make sense?
>
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index aab3aa59962..cd657ac63af 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -11479,7 +11479,7 @@ vectorizable_load (vec_info *vinfo,
>                      /* Try to use a single smaller load when we are about
>                         to load excess elements compared to the unrolled
>                         scalar loop.  */
> -                   if (known_gt ((vec_num * j + i + 1) * nunits,
> +                   if (maybe_gt ((vec_num * j + i + 1) * nunits,
>                                         (group_size * vf - gap)))
>                        {
>                          poly_uint64 remain = ((group_size * vf - gap)

That's a good point - this should indeed be maybe_gt

> @@ -11502,6 +11502,10 @@ vectorizable_load (vec_info *vinfo,
>                            /* Aligned access to the gap area when there's
>                               at least one element in it is OK.  */
>                            ;
> +                       else if (maybe_eq (remain, 0))
> +                         /* Handle remain.coeffs[0] == 0 case. Number of
> +                            elements is an exact multiple of vector length.  
> */
> +                         ;

Hmm, but here it should be known_eq?  And why doesn't
constant_multiple_p work then?  I think remain.coeffs[1] is never
negative here, but that would be another way to arrive at zero.

So in principle this would amend

                        if (known_ge ((vec_num * j + i + 1) * nunits
                                      - (group_size * vf - gap), nunits))
                          /* DR will be unused.  */
                          ltype = NULL_TREE;

?

Sure the original patch would work, but then the corresponding change
in get_group_loadstore_type needs to be done to ensure if remain
maybe zero we're forcing peeling for gaps, no?

>                          else
>                            {
>                              /* remain should now be > 0 and < nunits.  */
>
> > Thanks,
> > Richard
>
> Edwin
>
> >>>>>> gcc/ChangeLog:
> >>>>>>            * gcc/tree-vect-stmts.cc (gcc/tree-vect-stmts.cc): Prevent 
> >>>>>> divide-by-zero.
> >>>>>>            * testsuite/gcc.target/riscv/rvv/autovec/no-segment.c: 
> >>>>>> Remove dg-ice.
> >>>>>> ---
> >>>>>> No changes in v3. Depends on the risc-v backend option added in patch 
> >>>>>> 1 to
> >>>>>> trigger the ICE.
> >>>>>> ---
> >>>>>>     gcc/testsuite/gcc.target/riscv/rvv/autovec/no-segment.c | 1 -
> >>>>>>     gcc/tree-vect-stmts.cc                                  | 3 ++-
> >>>>>>     2 files changed, 2 insertions(+), 2 deletions(-)
> >>>>>>
> >>>>>> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/no-segment.c 
> >>>>>> b/gcc/testsuite/gcc.target/riscv/rvv/autovec/no-segment.c
> >>>>>> index dfbe09f01a1..79d03612a22 100644
> >>>>>> --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/no-segment.c
> >>>>>> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/no-segment.c
> >>>>>> @@ -1,6 +1,5 @@
> >>>>>>     /* { dg-do compile } */
> >>>>>>     /* { dg-options "-march=rv64gcv -mabi=lp64d 
> >>>>>> -mrvv-vector-bits=scalable -O3 -mno-autovec-segment" } */
> >>>>>> -/* { dg-ice "Floating point exception" } */
> >>>>>>
> >>>>>>     enum e { c, d };
> >>>>>>     enum g { f };
> >>>>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> >>>>>> index 4219ad832db..34f5736ba00 100644
> >>>>>> --- a/gcc/tree-vect-stmts.cc
> >>>>>> +++ b/gcc/tree-vect-stmts.cc
> >>>>>> @@ -11558,7 +11558,8 @@ vectorizable_load (vec_info *vinfo,
> >>>>>>                                     - (vec_num * j + i) * nunits);
> >>>>>>                                /* remain should now be > 0 and < 
> >>>>>> nunits.  */
> >>>>>>                                unsigned num;
> >>>>>> -                           if (constant_multiple_p (nunits, remain, 
> >>>>>> &num))
> >>>>>> +                           if (known_gt (remain, 0)
> >>>>>> +                               && constant_multiple_p (nunits, 
> >>>>>> remain, &num))
> >>>>>>                                  {
> >>>>>>                                    tree ptype;
> >>>>>>                                    new_vtype
> >
> >
> >>>>>> --
> >>>>>> 2.43.2
> >>>>>>
> >>>> Edwin
> >>>>

Reply via email to