Richard Biener <rguent...@suse.de> writes:
> Since we're now vectorizing by default at -O2 issues like PR101908
> become more important where we apply basic-block vectorization to
> parts of the function covering loads from function parameters passed
> on the stack.  Since we have no good idea how the stack pushing
> was performed but we do have a good idea that it was done recent
> when at the start of the function a STLF failure is inevitable
> if the argument passing code didn't use the same or larger vector
> stores and also has them aligned the same way.
>
> Until there's a robust IPA based solution the following implements
> target independent heuristics in the vectorizer to retain scalar
> loads for loads from parameters likely passed in memory (I use
> a BLKmode DECL_MODE check for this rather than firing up
> cummulative-args).  I've restricted this also to loads from the
> first "block" (that can be less than the first basic block if there's
> a call for example), since that covers the testcase.
>
> Note that for the testcase (but not c-ray from the bugreport) there's
> a x86 peephole2 that vectorizes things back, so the patch is
> not effective there.
>
> Any comments?  I know we're also looking at x86 port specific
> mitigations but the issue will hit arm and power/z as well I think.

I'm not sure this is a target-independent win.  In a loop that:

  stores 2 scalars
  loads the stored scalars as a vector
  adds a vector
  stores a vector

(no feedback), I see a 20% regression using elementwise accesses for
the load vs. using a normal vector load (measured on a Cortex-A72).
With feedback the elementwise version is still slower, but obviously
not by such a big factor.

I can see the argument for penalising this in the cost model in
a target-independent way, but I'm not sure we should change the
vectorisation strategy.

Thanks,
Richard

> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> Thanks,
> Richard.
>
> 2022-03-25  Richard Biener  <rguent...@suse.de>
>
>       PR tree-optimization/101908
>       * tree-vect-stmts.cc (get_group_load_store_type): Add
>       heuristic to limit BB vectorization of function parameters.
>
>       * gcc.dg/vect/bb-slp-pr101908.c: New testcase.
> ---
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c | 18 ++++++++++++++
>  gcc/tree-vect-stmts.cc                      | 27 ++++++++++++++++++++-
>  2 files changed, 44 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c 
> b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
> new file mode 100644
> index 00000000000..b7534a18f0e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target vect_double } */
> +
> +struct vec3 {
> +  double x, y, z;
> +};
> +
> +struct ray {
> +  struct vec3 orig, dir;
> +};
> +
> +void ray_sphere (struct ray ray, double *res)
> +{
> +  res[0] = ray.orig.y * ray.dir.x;
> +  res[1] = ray.orig.z * ray.dir.y;
> +}
> +
> +/* { dg-final { scan-tree-dump "STLF fail" "slp2" } } */
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index f7449a79d1c..1e37e9678b6 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -2197,7 +2197,32 @@ get_group_load_store_type (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>    /* Stores can't yet have gaps.  */
>    gcc_assert (slp_node || vls_type == VLS_LOAD || gap == 0);
>  
> -  if (slp_node)
> +  tree parm;
> +  if (!loop_vinfo
> +      && vls_type == VLS_LOAD
> +      /* The access is based on a PARM_DECL.  */
> +      && TREE_CODE (DR_BASE_ADDRESS (first_dr_info->dr)) == ADDR_EXPR
> +      && ((parm = TREE_OPERAND (DR_BASE_ADDRESS (first_dr_info->dr), 0)), 
> true)
> +      && TREE_CODE (parm) == PARM_DECL
> +      /* Likely passed on the stack.  */
> +      && DECL_MODE (parm) == BLKmode
> +      /* The access is in the first group.  */
> +      && first_dr_info->group == 0)
> +    {
> +      /* When doing BB vectorizing force early loads from function parameters
> +      passed on the stack and thus stored recently to be done elementwise
> +      to avoid store-to-load forwarding penalties.
> +      Note this will cause vectorization to fail for the load because of
> +      the fear of underestimating the cost of elementwise accesses,
> +      see the end of get_load_store_type.  We are then going to effectively
> +      do the same via handling the loads as external input to the SLP.  */
> +      if (dump_enabled_p ())
> +     dump_printf_loc (MSG_NOTE, vect_location,
> +                      "Not using vector loads from function parameter "
> +                      "for the fear of causing a STLF fail\n");
> +      *memory_access_type = VMAT_ELEMENTWISE;
> +    }
> +  else if (slp_node)
>      {
>        /* For SLP vectorization we directly vectorize a subchain
>        without permutation.  */

Reply via email to