On Tue, 12 Dec 2023, Richard Sandiford wrote: > Richard Biener <rguent...@suse.de> writes: > > The following aovids over/under-read of storage when vectorizing > > a non-grouped load with SLP. Instead of forcing peeling for gaps > > use a smaller load for the last vector which might access excess > > elements. This builds upon the existing optimization avoiding > > peeling for gaps, generalizing it to all gap widths leaving a > > power-of-two remaining number of elements (but it doesn't replace > > or improve that particular case at this point). > > > > I wonder if the poly relational compares I set up are good enough > > to guarantee /* remain should now be > 0 and < nunits. */. > > > > There is existing test coverage that runs into /* DR will be unused. */ > > always when the gap is wider than nunits. Compared to the > > existing gap == nunits/2 case this only adjusts the load that will > > cause the overrun at the end, not every load. Apart from the > > poly relational compares it should reliably cover these cases but > > I'll leave it for stage1 to remove. > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, I've also > > built and tested SPEC CPU 2017. > > > > OK? > > > > PR tree-optimization/112736 > > * tree-vect-stmts.cc (vectorizable_load): Extend optimization > > to avoid peeling for gaps to handle single-element non-groups > > we now allow with SLP. > > > > * gcc.dg/torture/pr112736.c: New testcase. > > Mostly LGTM FWIW. A couple of comments below: > > > --- > > gcc/testsuite/gcc.dg/torture/pr112736.c | 27 ++++++++ > > gcc/tree-vect-stmts.cc | 86 ++++++++++++++++++++----- > > 2 files changed, 96 insertions(+), 17 deletions(-) > > create mode 100644 gcc/testsuite/gcc.dg/torture/pr112736.c > > > > diff --git a/gcc/testsuite/gcc.dg/torture/pr112736.c > > b/gcc/testsuite/gcc.dg/torture/pr112736.c > > new file mode 100644 > > index 00000000000..6abb56edba3 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/torture/pr112736.c > > @@ -0,0 +1,27 @@ > > +/* { dg-do run { target *-*-linux* *-*-gnu* *-*-uclinux* } } */ > > + > > +#include <sys/mman.h> > > +#include <unistd.h> > > + > > +int a, c[3][5]; > > + > > +void __attribute__((noipa)) > > +fn1 (int * __restrict b) > > +{ > > + int e; > > + for (a = 2; a >= 0; a--) > > + for (e = 0; e < 4; e++) > > + c[a][e] = b[a]; > > +} > > + > > +int main() > > +{ > > + long pgsz = sysconf (_SC_PAGESIZE); > > + void *p = mmap (NULL, pgsz * 2, PROT_READ|PROT_WRITE, > > + MAP_ANONYMOUS|MAP_PRIVATE, 0, 0); > > + if (p == MAP_FAILED) > > + return 0; > > + mprotect (p, pgsz, PROT_NONE); > > + fn1 (p + pgsz); > > + return 0; > > +} > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > > index 390c8472fd6..c03c4c08c9d 100644 > > --- a/gcc/tree-vect-stmts.cc > > +++ b/gcc/tree-vect-stmts.cc > > @@ -11465,26 +11465,70 @@ vectorizable_load (vec_info *vinfo, > > if (new_vtype != NULL_TREE) > > ltype = half_vtype; > > } > > + /* Try to use a single smaller load when we are about > > + to load accesses excess elements compared to the > > s/accesses // > > > + unrolled scalar loop. > > + ??? This should cover the above case as well. */ > > + else if (known_gt ((vec_num * j + i + 1) * nunits, > > + (group_size * vf - gap))) > > At first it seemed odd to be using known_gt rather than maybe_gt here, > given that peeling for gaps is a correctness issue. But as things stand > this is just an optimisation, and VLA vectors (to whatever extent they're > handled by this code) allegedly work correctly without it. So I agree > known_gt is correct. We might need to revisit it when dealing with the > ??? though.
Yesh, maybe_gt would be needed if for correctness but then ... > > + { > > + if (known_ge ((vec_num * j + i + 1) * nunits > > + - (group_size * vf - gap), nunits)) > > + /* DR will be unused. */ > > + ltype = NULL_TREE; > > + else if (alignment_support_scheme == dr_aligned) > > + /* Aligned access to excess elements is OK if > > + at least one element is accessed in the > > + scalar loop. */ > > + ; > > + else > > + { > > + auto remain > > + = ((group_size * vf - gap) > > + - (vec_num * j + i) * nunits); > > + /* remain should now be > 0 and < nunits. */ ... we probably don't know it's < nunits anymore. Indeed lets revisit this at some later point. > > + unsigned num; > > + if (constant_multiple_p (nunits, remain, &num)) > > + { > > + tree ptype; > > + new_vtype > > + = vector_vector_composition_type (vectype, > > + num, > > + &ptype); > > + if (new_vtype) > > + ltype = ptype; > > + } > > + /* Else use multiple loads or a masked load? */ > > + } > > + } > > tree offset > > = (dataref_offset ? dataref_offset > > : build_int_cst (ref_type, 0)); > > - if (ltype != vectype > > - && memory_access_type == VMAT_CONTIGUOUS_REVERSE) > > + if (!ltype) > > + ; > > + else if (ltype != vectype > > + && memory_access_type == VMAT_CONTIGUOUS_REVERSE) > > { > > unsigned HOST_WIDE_INT gap_offset > > - = gap * tree_to_uhwi (TYPE_SIZE_UNIT (elem_type)); > > + = (tree_to_uhwi (TYPE_SIZE_UNIT (vectype)) > > + - tree_to_uhwi (TYPE_SIZE_UNIT (ltype))); > > tree gapcst = build_int_cst (ref_type, gap_offset); > > offset = size_binop (PLUS_EXPR, offset, gapcst); > > Can this trigger for the VLA case? Probably not. But it might be worth > using tree_to_poly_uint64 instead of tree_to_uhwi, and auto instead of > unsigned HOST_WIDE_INT, since the calculation is no longer based on scalars. I'm not 100% sure it cannot, so fixed. > > } > > - data_ref > > - = fold_build2 (MEM_REF, ltype, dataref_ptr, offset); > > - if (alignment_support_scheme == dr_aligned) > > - ; > > - else > > - TREE_TYPE (data_ref) > > - = build_aligned_type (TREE_TYPE (data_ref), > > - align * BITS_PER_UNIT); > > - if (ltype != vectype) > > + if (ltype) > > + { > > + data_ref > > + = fold_build2 (MEM_REF, ltype, dataref_ptr, offset); > > + if (alignment_support_scheme == dr_aligned) > > + ; > > + else > > + TREE_TYPE (data_ref) > > + = build_aligned_type (TREE_TYPE (data_ref), > > + align * BITS_PER_UNIT); > > + } > > + if (!ltype) > > + data_ref = build_constructor (vectype, NULL); > > + else if (ltype != vectype) > > { > > vect_copy_ref_info (data_ref, > > DR_REF (first_dr_info->dr)); > > @@ -11494,18 +11538,26 @@ vectorizable_load (vec_info *vinfo, > > gsi); > > data_ref = NULL; > > vec<constructor_elt, va_gc> *v; > > - vec_alloc (v, 2); > > + unsigned num > > + = (exact_div (tree_to_poly_uint64 > > + (TYPE_SIZE_UNIT (vectype)), > > + tree_to_poly_uint64 > > + (TYPE_SIZE_UNIT (ltype))) > > + .to_constant ()); > > FWIW, vector_unroll_factor probably fits here, but maybe that's a bit > of a stretch. The current version is OK with me too. Yeah, we've computed this number above statically (two, for the "old" correctness thing) or via the constant_multiple_p. It just was least awkward to recompute it here ... I'll add a comment. Richard. > Thanks, > Richard > > > + vec_alloc (v, num); > > if (memory_access_type == VMAT_CONTIGUOUS_REVERSE) > > { > > - CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, > > - build_zero_cst (ltype)); > > + while (--num) > > + CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, > > + build_zero_cst (ltype)); > > CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, tem); > > } > > else > > { > > CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, tem); > > - CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, > > - build_zero_cst (ltype)); > > + while (--num) > > + CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, > > + build_zero_cst (ltype)); > > } > > gcc_assert (new_vtype != NULL_TREE); > > if (new_vtype == vectype) > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)