On 7/24/2024 3:52 AM, Richard Biener wrote:
On Wed, Jul 24, 2024 at 1:31 AM Edwin Lu <e...@rivosinc.com> wrote:

On 7/23/2024 11:20 AM, Richard Sandiford wrote:
Edwin Lu <e...@rivosinc.com> writes:
On 7/23/2024 4:56 AM, Richard Biener wrote:
On Tue, Jul 23, 2024 at 1:03 AM Edwin Lu <e...@rivosinc.com> wrote:
Hi Richard,

On 5/31/2024 1:48 AM, Richard Biener wrote:
On Thu, May 30, 2024 at 2:11 AM Patrick O'Neill <patr...@rivosinc.com> wrote:
From: Greg McGary <g...@rivosinc.com>
Still a NACK.  If remain ends up zero then

                        /* Try to use a single smaller load when we are about
                           to load excess elements compared to the unrolled
                           scalar loop.  */
                        if (known_gt ((vec_num * j + i + 1) * nunits,
                                           (group_size * vf - gap)))
                          {
                            poly_uint64 remain = ((group_size * vf - gap)
                                                  - (vec_num * j + i) * nunits);
                            if (known_ge ((vec_num * j + i + 1) * nunits
                                          - (group_size * vf - gap), nunits))
                              /* DR will be unused.  */
                              ltype = NULL_TREE;

needs to be re-formulated so that the combined conditions make sure
this doesn't happen.  The outer known_gt should already ensure that
remain > 0.  For correctness that should possibly be maybe_gt though.
Yeah.  FWIW, I mentioned the maybe_gt thing in
https://gcc.gnu.org/pipermail/gcc-patches/2024-May/653013.html:

    Pre-existing, but shouldn't this be maybe_gt rather than known_gt?
    We can only skip doing it if we know for sure that the load won't cross
    the gap.  (Not sure whether the difference can trigger in practice.)

But AFAICT, the known_gt doesn't inherently prove that remain is known
to be nonzero.  It just proves that the gap between the end of the scalar
accesses and the end of this vector is known to be nonzero.

Putting the list back in the loop and CCing Richard S.

I'm currently looking into this patch and am trying to figure out what
is going on. Stepping through gdb, I see that remain == {coeffs = {0,
2}} and nunits == {coeffs = {2, 2}} (the outer known_gt returned true
with known_gt({coeffs = {8, 8}}, {coeffs = {6, 8}})).

    From what I understand, this falls under the umbrella of 0 <= remain <
nunits. The divide by zero error is because of the 0 <= remain which is
coming from the constant_multiple_p function in poly-int.h where it
performs the modulus NCa(a.coeffs[0]) % NCb(b.coeffs[0]).
(https://github.com/gcc-mirror/gcc/blob/master/gcc/poly-int.h#L1970-L1971)


    >                          if (known_ge ((vec_num * j + i + 1) * nunits
    >                                        - (group_size * vf - gap),
nunits))
    >                            /* DR will be unused.  */
    >                            ltype = NULL_TREE;

This if condition is a bit suspicious to me though. I'm seeing that it's
evaluating known_ge({coeffs = {2, 0}}, {coeffs = {2, 2}}) which is
returning false. Should it be maybe_ge instead?
No, we can only not emit a load if we know it won't be used, not if
it eventually cannot be used.
Agreed.

[switching round for easier reply]
After running some
tests, to me it looks like it doesn't vectorize quite as often; however,
I'm not fully sure what else to do when the coeffs can potentially be
equal to 0.

Should it even be possible for there to be a {coeffs = {0, n}}
situation? My understanding of how poly_ints are used for representing
vectorization is that the first coefficient is the number of elements
needed to make the minimum supported vector size. That is, if vector
lengths are 128 bits, element size is 32 bits, coeff[0] should be
minimum of 4. Is this understanding correct?
I was told n can be negative, but nunits.coeff[0] should be non-zero.
What would it mean for the coeffs[0] to be 0? Would that mean the vector length 
supports 0 bits?
coeffs = {A,B} just means A+B*X, where X is the number of vector
"chunks" beyond the minimum length.  It's certainly valid for a poly_int
to have a zero coeffs[0] (i.e. zero A).  For example, (the length of a
vector) - (the minimum length) would have this property.
Thanks for the explanation! I have a few clarification questions about this.
If I understand correctly, B would represent the number of elements the vector 
can have (for 128b vector operating on 32b elements, B == 4, but if operating 
on 64b elements B == 2); however, I'm not too sure what A represents.

On the poly_int docs, it says
An indeterminate value of 0 should usually represent the minimum possible 
runtime value, with c0 specifying the value in that case.
"minimum possible runtime value" doesn't make sense to me. Does it mean the 
potential minimum bound of elements it will operate on?

What is j and i when the divisor is zero?
The values I see in gdb are: vec_num = 4 j = 0 i = 3 vf = {coeffs = {2,
2}} nunits = {coeffs = {2, 2}} group_size = 4 gap = 2 vect_align = 2
remain = {coeffs = {0, 2}}
OK, so let's use D to mean "data" and G to mean "gap".  Then, for the
minimum vector length of 2 elements, we have:

    DD GG DD GG

The last load will read beyond the scalar loop if the vector loop happens
to handle all elements of the scalar loop.

For a vector length of 4 elements, we have:

    DDGG DDGG DDGG DDGG

where every load contains both data and gaps.  The same will be true
for larger vectors.

That's where remain={0,2} is coming from.  The last load is fully redundant
for the minimum VL but not for larger VL.
I think I understand the example but just want to see if I fully understand.
If we have 7 data elements with minimum vector length of 2 elements, the vector 
loop would essentially run 3 times on

    DD GG DD GG DD GG

There is an extra Data element which hasn't been processed. Since there is one 
element left over and is less than the minimum vector length, we can either
1. run scalar execution of the number of remaining elements (finishing scalar 
loop) or
2. see if a smaller vector size can process it nicely (i.e. minimum vector 
length of 1 element in this case. remain now is {1, 1}) or
3. use the same vector size but mask out the remaining element

Is this a correct?

Based on that, the patch below looks correct to me, but I might have
misunderstood the intent.
As an alternative to the original patch, would this also make sense?

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index aab3aa59962..cd657ac63af 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -11479,7 +11479,7 @@ vectorizable_load (vec_info *vinfo,
                      /* Try to use a single smaller load when we are about
                         to load excess elements compared to the unrolled
                         scalar loop.  */
-                   if (known_gt ((vec_num * j + i + 1) * nunits,
+                   if (maybe_gt ((vec_num * j + i + 1) * nunits,
                                         (group_size * vf - gap)))
                        {
                          poly_uint64 remain = ((group_size * vf - gap)
That's a good point - this should indeed be maybe_gt

@@ -11502,6 +11502,10 @@ vectorizable_load (vec_info *vinfo,
                            /* Aligned access to the gap area when there's
                               at least one element in it is OK.  */
                            ;
+                       else if (maybe_eq (remain, 0))
+                         /* Handle remain.coeffs[0] == 0 case. Number of
+                            elements is an exact multiple of vector length.  */
+                         ;
Hmm, but here it should be known_eq?  And why doesn't
constant_multiple_p work then?  I think remain.coeffs[1] is never
negative here, but that would be another way to arrive at zero.

If I use known_eq, the condition would be skipped and we'll enter the
final else statement which triggers the floating point exception in
constant_multiple_p.

I don't know what it would mean for remain.coeffs[1] to be negative.
To me it doesn't really make sense given my understanding that
it represents the number of additional chunks beyond the minimum vector
size. I think it would make sense to assert remain.coeffs[1] >= 0?

I believe it would need to be maybe_eq since if remain.coeffs[1] is
non-negative, the only way for remain == 0 is if X == 0 and remain.coeffs[0] == 
0.

So in principle this would amend

                         if (known_ge ((vec_num * j + i + 1) * nunits
                                       - (group_size * vf - gap), nunits))
                           /* DR will be unused.  */
                           ltype = NULL_TREE;

?

If we know remain == 0, then yes, I think it does amend that statement.
However, since we don't know what X is, we aren't guaranteed that the load
will eventually not be used.

Sure the original patch would work, but then the corresponding change
in get_group_loadstore_type needs to be done to ensure if remain
maybe zero we're forcing peeling for gaps, no?

If remain may be zero, then wouldn't all the data already be loaded into
vectors during the vector loop? Would we still need to peel for gaps
in the scalar loop if all the data is processed?

I'm not sure if my understanding is correct so please forgive me if
this doesn't make sense.

                          else
                            {
                              /* remain should now be > 0 and < nunits.  */

Thanks,
Richard

Edwin

gcc/ChangeLog:
            * gcc/tree-vect-stmts.cc (gcc/tree-vect-stmts.cc): Prevent 
divide-by-zero.
            * testsuite/gcc.target/riscv/rvv/autovec/no-segment.c: Remove 
dg-ice.
---
No changes in v3. Depends on the risc-v backend option added in patch 1 to
trigger the ICE.
---
     gcc/testsuite/gcc.target/riscv/rvv/autovec/no-segment.c | 1 -
     gcc/tree-vect-stmts.cc                                  | 3 ++-
     2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/no-segment.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/no-segment.c
index dfbe09f01a1..79d03612a22 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/no-segment.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/no-segment.c
@@ -1,6 +1,5 @@
     /* { dg-do compile } */
     /* { dg-options "-march=rv64gcv -mabi=lp64d -mrvv-vector-bits=scalable -O3 
-mno-autovec-segment" } */
-/* { dg-ice "Floating point exception" } */

     enum e { c, d };
     enum g { f };
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 4219ad832db..34f5736ba00 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -11558,7 +11558,8 @@ vectorizable_load (vec_info *vinfo,
                                     - (vec_num * j + i) * nunits);
                                /* remain should now be > 0 and < nunits.  */
                                unsigned num;
-                           if (constant_multiple_p (nunits, remain, &num))
+                           if (known_gt (remain, 0)
+                               && constant_multiple_p (nunits, remain, &num))
                                  {
                                    tree ptype;
                                    new_vtype

Reply via email to