Richard Biener <rguent...@suse.de> writes: > Since we're now vectorizing by default at -O2 issues like PR101908 > become more important where we apply basic-block vectorization to > parts of the function covering loads from function parameters passed > on the stack. Since we have no good idea how the stack pushing > was performed but we do have a good idea that it was done recent > when at the start of the function a STLF failure is inevitable > if the argument passing code didn't use the same or larger vector > stores and also has them aligned the same way. > > Until there's a robust IPA based solution the following implements > target independent heuristics in the vectorizer to retain scalar > loads for loads from parameters likely passed in memory (I use > a BLKmode DECL_MODE check for this rather than firing up > cummulative-args). I've restricted this also to loads from the > first "block" (that can be less than the first basic block if there's > a call for example), since that covers the testcase. > > Note that for the testcase (but not c-ray from the bugreport) there's > a x86 peephole2 that vectorizes things back, so the patch is > not effective there. > > Any comments? I know we're also looking at x86 port specific > mitigations but the issue will hit arm and power/z as well I think.
I'm not sure this is a target-independent win. In a loop that: stores 2 scalars loads the stored scalars as a vector adds a vector stores a vector (no feedback), I see a 20% regression using elementwise accesses for the load vs. using a normal vector load (measured on a Cortex-A72). With feedback the elementwise version is still slower, but obviously not by such a big factor. I can see the argument for penalising this in the cost model in a target-independent way, but I'm not sure we should change the vectorisation strategy. Thanks, Richard > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > Thanks, > Richard. > > 2022-03-25 Richard Biener <rguent...@suse.de> > > PR tree-optimization/101908 > * tree-vect-stmts.cc (get_group_load_store_type): Add > heuristic to limit BB vectorization of function parameters. > > * gcc.dg/vect/bb-slp-pr101908.c: New testcase. > --- > gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c | 18 ++++++++++++++ > gcc/tree-vect-stmts.cc | 27 ++++++++++++++++++++- > 2 files changed, 44 insertions(+), 1 deletion(-) > create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c > > diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c > b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c > new file mode 100644 > index 00000000000..b7534a18f0e > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c > @@ -0,0 +1,18 @@ > +/* { dg-do compile } */ > +/* { dg-require-effective-target vect_double } */ > + > +struct vec3 { > + double x, y, z; > +}; > + > +struct ray { > + struct vec3 orig, dir; > +}; > + > +void ray_sphere (struct ray ray, double *res) > +{ > + res[0] = ray.orig.y * ray.dir.x; > + res[1] = ray.orig.z * ray.dir.y; > +} > + > +/* { dg-final { scan-tree-dump "STLF fail" "slp2" } } */ > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > index f7449a79d1c..1e37e9678b6 100644 > --- a/gcc/tree-vect-stmts.cc > +++ b/gcc/tree-vect-stmts.cc > @@ -2197,7 +2197,32 @@ get_group_load_store_type (vec_info *vinfo, > stmt_vec_info stmt_info, > /* Stores can't yet have gaps. */ > gcc_assert (slp_node || vls_type == VLS_LOAD || gap == 0); > > - if (slp_node) > + tree parm; > + if (!loop_vinfo > + && vls_type == VLS_LOAD > + /* The access is based on a PARM_DECL. */ > + && TREE_CODE (DR_BASE_ADDRESS (first_dr_info->dr)) == ADDR_EXPR > + && ((parm = TREE_OPERAND (DR_BASE_ADDRESS (first_dr_info->dr), 0)), > true) > + && TREE_CODE (parm) == PARM_DECL > + /* Likely passed on the stack. */ > + && DECL_MODE (parm) == BLKmode > + /* The access is in the first group. */ > + && first_dr_info->group == 0) > + { > + /* When doing BB vectorizing force early loads from function parameters > + passed on the stack and thus stored recently to be done elementwise > + to avoid store-to-load forwarding penalties. > + Note this will cause vectorization to fail for the load because of > + the fear of underestimating the cost of elementwise accesses, > + see the end of get_load_store_type. We are then going to effectively > + do the same via handling the loads as external input to the SLP. */ > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_NOTE, vect_location, > + "Not using vector loads from function parameter " > + "for the fear of causing a STLF fail\n"); > + *memory_access_type = VMAT_ELEMENTWISE; > + } > + else if (slp_node) > { > /* For SLP vectorization we directly vectorize a subchain > without permutation. */