On Wed, 12 Jun 2024, Richard Biener wrote:

> On Tue, 11 Jun 2024, Richard Sandiford wrote:
> 
> > Don't think it makes any difference, but:
> > 
> > Richard Biener <rguent...@suse.de> writes:
> > > @@ -2151,7 +2151,16 @@ get_group_load_store_type (vec_info *vinfo, 
> > > stmt_vec_info stmt_info,
> > >                access excess elements.
> > >                ???  Enhancements include peeling multiple iterations
> > >                or using masked loads with a static mask.  */
> > > -           || (group_size * cvf) % cnunits + group_size - gap < cnunits))
> > > +           || ((group_size * cvf) % cnunits + group_size - gap < cnunits
> > > +               /* But peeling a single scalar iteration is enough if
> > > +                  we can use the next power-of-two sized partial
> > > +                  access.  */
> > > +               && ((cremain = (group_size * cvf - gap) % cnunits), true
> > 
> > ...this might be less surprising as:
> > 
> >                   && ((cremain = (group_size * cvf - gap) % cnunits, true)
> > 
> > in terms of how the &&s line up.
> 
> Yeah - I'll fix before pushing.

The aarch64 CI shows that a few testcases no longer use SVE
(gcc.target/aarch64/sve/slp_perm_{4,7,8}.c) because peeling
for gaps is deemed isufficient.  Formerly we had

          if (loop_vinfo
              && *memory_access_type == VMAT_CONTIGUOUS
              && SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
              && !multiple_p (group_size * LOOP_VINFO_VECT_FACTOR 
(loop_vinfo),
                              nunits))
            {
              unsigned HOST_WIDE_INT cnunits, cvf;
              if (!can_overrun_p
                  || !nunits.is_constant (&cnunits)
                  || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant 
(&cvf)
                  /* Peeling for gaps assumes that a single scalar 
iteration
                     is enough to make sure the last vector iteration 
doesn't
                     access excess elements.
                     ???  Enhancements include peeling multiple iterations
                     or using masked loads with a static mask.  */
                  || (group_size * cvf) % cnunits + group_size - gap < 
cnunits)
                {
                  if (dump_enabled_p ())
                    dump_printf_loc (MSG_MISSED_OPTIMIZATION, 
vect_location,
                                     "peeling for gaps insufficient for "
                                     "access\n");

and in all cases multiple_p (group_size * LOOP_VINFO_VECT_FACTOR, nunits)
is true so we didn't check for whether peeling one iteration is
sufficient.  But after the refactoring the outer checks merely
indicates there's overrun (which is there already because gap != 0).

That is, we never verified, for the "regular" gap case, whether peeling
for a single iteration is sufficient.  But now of course we run into
the inner check which will always trigger if earlier checks didn't
work out to set overrun_p to false.

For slp_perm_8.c we have a group_size of two, nunits is {16, 16}
and VF is {8, 8} and gap is one.  Given we know the
multiple_p we know that (group_size * cvf) % cnunits is zero,
so what remains is group_size - gap < nunits but 1 is probably
always less than {16, 16}.

The new logic I added in the later patch that peeling a single
iteration is OK when we use a smaller, rounded-up to power-of-two
sized access is

                  || ((group_size * cvf) % cnunits + group_size - gap < 
cnunits
                      /* But peeling a single scalar iteration is enough 
if
                         we can use the next power-of-two sized partial
                         access.  */
                      && (cremain = (group_size * cvf - gap) % cnunits, 
true)
                      && (cpart_size = (1 << ceil_log2 (cremain))) != 
cnunits
                      && vector_vector_composition_type 
                           (vectype, cnunits / cpart_size, 
                            &half_vtype) == NULL_TREE)))

again knowing the multiple we know cremain is nunits - gap and with
gap == 1 rounding this size up will yield nunits and thus the existing
peeling is OK.  Something is inconsistent here and the pre-existing

  (group_size * cvf) % cnunits + group_size - gap < cnunits

check looks suspicious for a general check.

  (group_size * cvf - gap)

should be the number of elements we can access without touching
excess elements.  Peeling a single iteration will make sure
group_size * cvf + group_size - gap is accessed
(that's group_size * (cvf + 1) - gap).  The excess elements
touched in the vector loop are

  cnunits - (group_size * cvf - gap) % cnunits

I think that number needs to be less or equal to group_size, so
the correct check should be

  (group_size * cvf - gap) % cnunits + group_size < cnunits

for the SVE case that's (nunits - 1) + 2 < nunits which should
simplify to false.  Now the question is how to formulate this
with poly-ints in a way that it works out, for the case in
question doing the overrun check as

          if (overrun_p
              && can_div_trunc_p (group_size
                                  * LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 
gap,
                                  nunits, &tem, &remain)
              && maybe_lt (remain + group_size, nunits))

seems to do the trick.  I'm going to test this, will post an updated
series for the CI.

Richard.

> Thanks,
> Richard.
> 
> > Thanks,
> > Richard
> > 
> > > +                   && ((cpart_size = (1 << ceil_log2 (cremain)))
> > > +                       != cnunits)
> > > +                   && vector_vector_composition_type
> > > +                        (vectype, cnunits / cpart_size,
> > > +                         &half_vtype) == NULL_TREE))))
> > >       {
> > >         if (dump_enabled_p ())
> > >           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > > @@ -11599,6 +11608,27 @@ vectorizable_load (vec_info *vinfo,
> > >                         gcc_assert (new_vtype
> > >                                     || LOOP_VINFO_PEELING_FOR_GAPS
> > >                                          (loop_vinfo));
> > > +                     /* But still reduce the access size to the next
> > > +                        required power-of-two so peeling a single
> > > +                        scalar iteration is sufficient.  */
> > > +                     unsigned HOST_WIDE_INT cremain;
> > > +                     if (remain.is_constant (&cremain))
> > > +                       {
> > > +                         unsigned HOST_WIDE_INT cpart_size
> > > +                           = 1 << ceil_log2 (cremain);
> > > +                         if (known_gt (nunits, cpart_size)
> > > +                             && constant_multiple_p (nunits, cpart_size,
> > > +                                                     &num))
> > > +                           {
> > > +                             tree ptype;
> > > +                             new_vtype
> > > +                               = vector_vector_composition_type (vectype,
> > > +                                                                 num,
> > > +                                                                 &ptype);
> > > +                             if (new_vtype)
> > > +                               ltype = ptype;
> > > +                           }
> > > +                       }
> > >                     }
> > >                 }
> > >               tree offset
> > 
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to