On Tue, 12 Dec 2023, Richard Sandiford wrote:

> Richard Biener <rguent...@suse.de> writes:
> > The following aovids over/under-read of storage when vectorizing
> > a non-grouped load with SLP.  Instead of forcing peeling for gaps
> > use a smaller load for the last vector which might access excess
> > elements.  This builds upon the existing optimization avoiding
> > peeling for gaps, generalizing it to all gap widths leaving a
> > power-of-two remaining number of elements (but it doesn't replace
> > or improve that particular case at this point).
> >
> > I wonder if the poly relational compares I set up are good enough
> > to guarantee /* remain should now be > 0 and < nunits.  */.
> >
> > There is existing test coverage that runs into /* DR will be unused.  */
> > always when the gap is wider than nunits.  Compared to the
> > existing gap == nunits/2 case this only adjusts the load that will
> > cause the overrun at the end, not every load.  Apart from the
> > poly relational compares it should reliably cover these cases but
> > I'll leave it for stage1 to remove.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu, I've also
> > built and tested SPEC CPU 2017.
> >
> > OK?
> >
> >     PR tree-optimization/112736
> >     * tree-vect-stmts.cc (vectorizable_load): Extend optimization
> >     to avoid peeling for gaps to handle single-element non-groups
> >     we now allow with SLP.
> >
> >     * gcc.dg/torture/pr112736.c: New testcase.
> 
> Mostly LGTM FWIW.  A couple of comments below:
> 
> > ---
> >  gcc/testsuite/gcc.dg/torture/pr112736.c | 27 ++++++++
> >  gcc/tree-vect-stmts.cc                  | 86 ++++++++++++++++++++-----
> >  2 files changed, 96 insertions(+), 17 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/torture/pr112736.c
> >
> > diff --git a/gcc/testsuite/gcc.dg/torture/pr112736.c 
> > b/gcc/testsuite/gcc.dg/torture/pr112736.c
> > new file mode 100644
> > index 00000000000..6abb56edba3
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/torture/pr112736.c
> > @@ -0,0 +1,27 @@
> > +/* { dg-do run { target *-*-linux* *-*-gnu* *-*-uclinux* } } */
> > +
> > +#include <sys/mman.h>
> > +#include <unistd.h>
> > +
> > +int a, c[3][5];
> > +
> > +void __attribute__((noipa))
> > +fn1 (int * __restrict b)
> > +{
> > +  int e;
> > +  for (a = 2; a >= 0; a--)
> > +    for (e = 0; e < 4; e++)
> > +      c[a][e] = b[a];
> > +}
> > +
> > +int main()
> > +{
> > +  long pgsz = sysconf (_SC_PAGESIZE);
> > +  void *p = mmap (NULL, pgsz * 2, PROT_READ|PROT_WRITE,
> > +                  MAP_ANONYMOUS|MAP_PRIVATE, 0, 0);
> > +  if (p == MAP_FAILED)
> > +    return 0;
> > +  mprotect (p, pgsz, PROT_NONE);
> > +  fn1 (p + pgsz);
> > +  return 0;
> > +}
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index 390c8472fd6..c03c4c08c9d 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -11465,26 +11465,70 @@ vectorizable_load (vec_info *vinfo,
> >                     if (new_vtype != NULL_TREE)
> >                       ltype = half_vtype;
> >                   }
> > +               /* Try to use a single smaller load when we are about
> > +                  to load accesses excess elements compared to the
> 
> s/accesses //
> 
> > +                  unrolled scalar loop.
> > +                  ???  This should cover the above case as well.  */
> > +               else if (known_gt ((vec_num * j + i + 1) * nunits,
> > +                                  (group_size * vf - gap)))
> 
> At first it seemed odd to be using known_gt rather than maybe_gt here,
> given that peeling for gaps is a correctness issue.  But as things stand
> this is just an optimisation, and VLA vectors (to whatever extent they're
> handled by this code) allegedly work correctly without it.  So I agree
> known_gt is correct.  We might need to revisit it when dealing with the
> ??? though.

Yesh, maybe_gt would be needed if for correctness but then ...

> > +                 {
> > +                   if (known_ge ((vec_num * j + i + 1) * nunits
> > +                                 - (group_size * vf - gap), nunits))
> > +                     /* DR will be unused.  */
> > +                     ltype = NULL_TREE;
> > +                   else if (alignment_support_scheme == dr_aligned)
> > +                     /* Aligned access to excess elements is OK if
> > +                        at least one element is accessed in the
> > +                        scalar loop.  */
> > +                     ;
> > +                   else
> > +                     {
> > +                       auto remain
> > +                         = ((group_size * vf - gap)
> > +                            - (vec_num * j + i) * nunits);
> > +                       /* remain should now be > 0 and < nunits.  */

... we probably don't know it's < nunits anymore.  Indeed lets revisit
this at some later point.

> > +                       unsigned num;
> > +                       if (constant_multiple_p (nunits, remain, &num))
> > +                         {
> > +                           tree ptype;
> > +                           new_vtype
> > +                             = vector_vector_composition_type (vectype,
> > +                                                               num,
> > +                                                               &ptype);
> > +                           if (new_vtype)
> > +                             ltype = ptype;
> > +                         }
> > +                       /* Else use multiple loads or a masked load?  */
> > +                     }
> > +                 }
> >                 tree offset
> >                   = (dataref_offset ? dataref_offset
> >                                     : build_int_cst (ref_type, 0));
> > -               if (ltype != vectype
> > -                   && memory_access_type == VMAT_CONTIGUOUS_REVERSE)
> > +               if (!ltype)
> > +                 ;
> > +               else if (ltype != vectype
> > +                        && memory_access_type == VMAT_CONTIGUOUS_REVERSE)
> >                   {
> >                     unsigned HOST_WIDE_INT gap_offset
> > -                     = gap * tree_to_uhwi (TYPE_SIZE_UNIT (elem_type));
> > +                     = (tree_to_uhwi (TYPE_SIZE_UNIT (vectype))
> > +                        - tree_to_uhwi (TYPE_SIZE_UNIT (ltype)));
> >                     tree gapcst = build_int_cst (ref_type, gap_offset);
> >                     offset = size_binop (PLUS_EXPR, offset, gapcst);
> 
> Can this trigger for the VLA case?  Probably not.  But it might be worth
> using tree_to_poly_uint64 instead of tree_to_uhwi, and auto instead of
> unsigned HOST_WIDE_INT, since the calculation is no longer based on scalars.

I'm not 100% sure it cannot, so fixed.

> >                   }
> > -               data_ref
> > -                 = fold_build2 (MEM_REF, ltype, dataref_ptr, offset);
> > -               if (alignment_support_scheme == dr_aligned)
> > -                 ;
> > -               else
> > -                 TREE_TYPE (data_ref)
> > -                   = build_aligned_type (TREE_TYPE (data_ref),
> > -                                         align * BITS_PER_UNIT);
> > -               if (ltype != vectype)
> > +               if (ltype)
> > +                 {
> > +                   data_ref
> > +                     = fold_build2 (MEM_REF, ltype, dataref_ptr, offset);
> > +                   if (alignment_support_scheme == dr_aligned)
> > +                     ;
> > +                   else
> > +                     TREE_TYPE (data_ref)
> > +                       = build_aligned_type (TREE_TYPE (data_ref),
> > +                                             align * BITS_PER_UNIT);
> > +                 }
> > +               if (!ltype)
> > +                 data_ref = build_constructor (vectype, NULL);
> > +               else if (ltype != vectype)
> >                   {
> >                     vect_copy_ref_info (data_ref,
> >                                         DR_REF (first_dr_info->dr));
> > @@ -11494,18 +11538,26 @@ vectorizable_load (vec_info *vinfo,
> >                                                  gsi);
> >                     data_ref = NULL;
> >                     vec<constructor_elt, va_gc> *v;
> > -                   vec_alloc (v, 2);
> > +                   unsigned num
> > +                     = (exact_div (tree_to_poly_uint64
> > +                                     (TYPE_SIZE_UNIT (vectype)),
> > +                                   tree_to_poly_uint64
> > +                                     (TYPE_SIZE_UNIT (ltype)))
> > +                        .to_constant ());
> 
> FWIW, vector_unroll_factor probably fits here, but maybe that's a bit
> of a stretch.  The current version is OK with me too.

Yeah, we've computed this number above statically (two, for the "old"
correctness thing) or via the constant_multiple_p.  It just was least
awkward to recompute it here ...  I'll add a comment.

Richard.

> Thanks,
> Richard
> 
> > +                   vec_alloc (v, num);
> >                     if (memory_access_type == VMAT_CONTIGUOUS_REVERSE)
> >                       {
> > -                       CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
> > -                                               build_zero_cst (ltype));
> > +                       while (--num)
> > +                         CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
> > +                                                 build_zero_cst (ltype));
> >                         CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, tem);
> >                       }
> >                     else
> >                       {
> >                         CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, tem);
> > -                       CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
> > -                                               build_zero_cst (ltype));
> > +                       while (--num)
> > +                         CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
> > +                                                 build_zero_cst (ltype));
> >                       }
> >                     gcc_assert (new_vtype != NULL_TREE);
> >                     if (new_vtype == vectype)
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to