Re: [PATCH] tree-optimization/115385 - handle more gaps with peeling of a single iteration

Richard Sandiford Wed, 12 Jun 2024 02:38:34 -0700

Richard Biener <rguent...@suse.de> writes:
> On Wed, 12 Jun 2024, Richard Biener wrote:
>
>> On Tue, 11 Jun 2024, Richard Sandiford wrote:
>> 
>> > Don't think it makes any difference, but:
>> > 
>> > Richard Biener <rguent...@suse.de> writes:
>> > > @@ -2151,7 +2151,16 @@ get_group_load_store_type (vec_info *vinfo, 
>> > > stmt_vec_info stmt_info,
>> > >                       access excess elements.
>> > >                       ???  Enhancements include peeling multiple 
>> > > iterations
>> > >                       or using masked loads with a static mask.  */
>> > > -                  || (group_size * cvf) % cnunits + group_size - gap < 
>> > > cnunits))
>> > > +                  || ((group_size * cvf) % cnunits + group_size - gap < 
>> > > cnunits
>> > > +                      /* But peeling a single scalar iteration is 
>> > > enough if
>> > > +                         we can use the next power-of-two sized partial
>> > > +                         access.  */
>> > > +                      && ((cremain = (group_size * cvf - gap) % 
>> > > cnunits), true
>> > 
>> > ...this might be less surprising as:
>> > 
>> >                  && ((cremain = (group_size * cvf - gap) % cnunits, true)
>> > 
>> > in terms of how the &&s line up.
>> 
>> Yeah - I'll fix before pushing.
>
> The aarch64 CI shows that a few testcases no longer use SVE
> (gcc.target/aarch64/sve/slp_perm_{4,7,8}.c) because peeling
> for gaps is deemed isufficient.  Formerly we had
>
>           if (loop_vinfo
>               && *memory_access_type == VMAT_CONTIGUOUS
>               && SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
>               && !multiple_p (group_size * LOOP_VINFO_VECT_FACTOR 
> (loop_vinfo),
>                               nunits))
>             {
>               unsigned HOST_WIDE_INT cnunits, cvf;
>               if (!can_overrun_p
>                   || !nunits.is_constant (&cnunits)
>                   || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant 
> (&cvf)
>                   /* Peeling for gaps assumes that a single scalar 
> iteration
>                      is enough to make sure the last vector iteration 
> doesn't
>                      access excess elements.
>                      ???  Enhancements include peeling multiple iterations
>                      or using masked loads with a static mask.  */
>                   || (group_size * cvf) % cnunits + group_size - gap < 
> cnunits)
>                 {
>                   if (dump_enabled_p ())
>                     dump_printf_loc (MSG_MISSED_OPTIMIZATION, 
> vect_location,
>                                      "peeling for gaps insufficient for "
>                                      "access\n");
>
> and in all cases multiple_p (group_size * LOOP_VINFO_VECT_FACTOR, nunits)
> is true so we didn't check for whether peeling one iteration is
> sufficient.  But after the refactoring the outer checks merely
> indicates there's overrun (which is there already because gap != 0).
>
> That is, we never verified, for the "regular" gap case, whether peeling
> for a single iteration is sufficient.  But now of course we run into
> the inner check which will always trigger if earlier checks didn't
> work out to set overrun_p to false.
>
> For slp_perm_8.c we have a group_size of two, nunits is {16, 16}
> and VF is {8, 8} and gap is one.  Given we know the
> multiple_p we know that (group_size * cvf) % cnunits is zero,
> so what remains is group_size - gap < nunits but 1 is probably
> always less than {16, 16}.


I thought the idea was that the size of the gap was immaterial
for VMAT_CONTIGUOUS, on the assumption that it would never be
bigger than a page.  That is, any gap loaded by the final
unpeeled iteration would belong to the same page as the non-gap
data from either the same vector iteration or the subsequent
peeled scalar iteration.

Will have to think more about this if that doesn't affect the
rest of the message, but FWIW...

> The new logic I added in the later patch that peeling a single
> iteration is OK when we use a smaller, rounded-up to power-of-two
> sized access is
>
>                   || ((group_size * cvf) % cnunits + group_size - gap < 
> cnunits
>                       /* But peeling a single scalar iteration is enough 
> if
>                          we can use the next power-of-two sized partial
>                          access.  */
>                       && (cremain = (group_size * cvf - gap) % cnunits, 
> true)
>                       && (cpart_size = (1 << ceil_log2 (cremain))) != 
> cnunits
>                       && vector_vector_composition_type 
>                            (vectype, cnunits / cpart_size, 
>                             &half_vtype) == NULL_TREE)))
>
> again knowing the multiple we know cremain is nunits - gap and with
> gap == 1 rounding this size up will yield nunits and thus the existing
> peeling is OK.  Something is inconsistent here and the pre-existing
>
>   (group_size * cvf) % cnunits + group_size - gap < cnunits
>
> check looks suspicious for a general check.
>
>   (group_size * cvf - gap)
>
> should be the number of elements we can access without touching
> excess elements.  Peeling a single iteration will make sure
> group_size * cvf + group_size - gap is accessed
> (that's group_size * (cvf + 1) - gap).  The excess elements
> touched in the vector loop are
>
>   cnunits - (group_size * cvf - gap) % cnunits
>
> I think that number needs to be less or equal to group_size, so
> the correct check should be
>
>   (group_size * cvf - gap) % cnunits + group_size < cnunits
>
> for the SVE case that's (nunits - 1) + 2 < nunits which should
> simplify to false.  Now the question is how to formulate this
> with poly-ints in a way that it works out, for the case in
> question doing the overrun check as
>
>           if (overrun_p
>               && can_div_trunc_p (group_size
>                                   * LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 
> gap,
>                                   nunits, &tem, &remain)
>               && maybe_lt (remain + group_size, nunits))

...it looks like this should be known_lt, if we're relying on it
for correctness.

Thanks,
Richard

> seems to do the trick.  I'm going to test this, will post an updated
> series for the CI.
>
> Richard.
>
>> Thanks,
>> Richard.
>> 
>> > Thanks,
>> > Richard
>> > 
>> > > +                          && ((cpart_size = (1 << ceil_log2 (cremain)))
>> > > +                              != cnunits)
>> > > +                          && vector_vector_composition_type
>> > > +                               (vectype, cnunits / cpart_size,
>> > > +                                &half_vtype) == NULL_TREE))))
>> > >              {
>> > >                if (dump_enabled_p ())
>> > >                  dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> > > @@ -11599,6 +11608,27 @@ vectorizable_load (vec_info *vinfo,
>> > >                                gcc_assert (new_vtype
>> > >                                            || LOOP_VINFO_PEELING_FOR_GAPS
>> > >                                                 (loop_vinfo));
>> > > +                            /* But still reduce the access size to the 
>> > > next
>> > > +                               required power-of-two so peeling a single
>> > > +                               scalar iteration is sufficient.  */
>> > > +                            unsigned HOST_WIDE_INT cremain;
>> > > +                            if (remain.is_constant (&cremain))
>> > > +                              {
>> > > +                                unsigned HOST_WIDE_INT cpart_size
>> > > +                                  = 1 << ceil_log2 (cremain);
>> > > +                                if (known_gt (nunits, cpart_size)
>> > > +                                    && constant_multiple_p (nunits, 
>> > > cpart_size,
>> > > +                                                            &num))
>> > > +                                  {
>> > > +                                    tree ptype;
>> > > +                                    new_vtype
>> > > +                                      = vector_vector_composition_type 
>> > > (vectype,
>> > > +                                                                        
>> > > num,
>> > > +                                                                        
>> > > &ptype);
>> > > +                                    if (new_vtype)
>> > > +                                      ltype = ptype;
>> > > +                                  }
>> > > +                              }
>> > >                            }
>> > >                        }
>> > >                      tree offset
>> > 
>> 
>>

Re: [PATCH] tree-optimization/115385 - handle more gaps with peeling of a single iteration

Reply via email to