On Wed, Jul 23, 2025 at 1:51 PM Andrew Stubbs <a...@baylibre.com> wrote:
>
> From: Julian Brown <jul...@codesourcery.com>
>
> This patch was originally written by Julian in 2021 for the OG10 branch,
> but does not appear to have been proposed for upstream at that time, or
> since.  I've now forward ported it and retested it.  Thomas reported
> test regressions with this patch on the OG14 branch, but I think it was
> exposing some bugs in the backend; I can't reproduce those failures on
> mainline.
>
> I'm not sure what the original motivating test case was, but I see that
> the gfortran.dg/vect/fast-math-pr37021.f90 testcase is reduced from ~24k
> lines of assembler down to <7k, on amdgcn.
>
> OK for mainline?

I do wonder if the single_element_p check isn't for correctness?  And how
the patch makes a difference when we still require SLP_TREE_LANES
(slp_node) == 1?

Richard.

> Andrew
>
> ------------
>
> For AMD GCN, the instructions available for loading/storing vectors are
> always scatter/gather operations (i.e. there are separate addresses for
> each vector lane), so the current heuristic to avoid gather/scatter
> operations with too many elements in get_group_load_store_type is
> counterproductive. Avoiding such operations in that function can
> subsequently lead to a missed vectorization opportunity whereby later
> analyses in the vectorizer try to use a very wide array type which is
> not available on this target, and thus it bails out.
>
> This patch adds a target hook to override the "single_element_p"
> heuristic in the function as a target hook, and activates it for GCN. This
> allows much better code to be generated for affected loops.
>
> Co-authored-by:  Julian Brown  <jul...@codesourcery.com>
>
> gcc/
>         * doc/tm.texi.in (TARGET_VECTORIZE_PREFER_GATHER_SCATTER): Add
>         documentation hook.
>         * doc/tm.texi: Regenerate.
>         * target.def (prefer_gather_scatter): Add target hook under 
> vectorizer.
>         * tree-vect-stmts.cc (get_group_load_store_type): Optionally prefer
>         gather/scatter instructions to scalar/elementwise fallback.
>         * config/gcn/gcn.cc (TARGET_VECTORIZE_PREFER_GATHER_SCATTER): Define
>         hook.
> ---
>  gcc/config/gcn/gcn.cc  | 2 ++
>  gcc/doc/tm.texi        | 5 +++++
>  gcc/doc/tm.texi.in     | 2 ++
>  gcc/target.def         | 8 ++++++++
>  gcc/tree-vect-stmts.cc | 2 +-
>  5 files changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
> index 3b26d5c6a58..d451bf43355 100644
> --- a/gcc/config/gcn/gcn.cc
> +++ b/gcc/config/gcn/gcn.cc
> @@ -7998,6 +7998,8 @@ gcn_dwarf_register_span (rtx rtl)
>    gcn_vector_alignment_reachable
>  #undef  TARGET_VECTOR_MODE_SUPPORTED_P
>  #define TARGET_VECTOR_MODE_SUPPORTED_P gcn_vector_mode_supported_p
> +#undef  TARGET_VECTORIZE_PREFER_GATHER_SCATTER
> +#define TARGET_VECTORIZE_PREFER_GATHER_SCATTER true
>
>  #undef TARGET_DOCUMENTATION_NAME
>  #define TARGET_DOCUMENTATION_NAME "AMD GCN"
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 5e305643b3a..29177d81466 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -6511,6 +6511,11 @@ The default is @code{NULL_TREE} which means to not 
> vectorize scatter
>  stores.
>  @end deftypefn
>
> +@deftypevr {Target Hook} bool TARGET_VECTORIZE_PREFER_GATHER_SCATTER
> +This hook is set to TRUE if gather loads or scatter stores are cheaper on
> +this target than a sequence of elementwise loads or stores.
> +@end deftypevr
> +
>  @deftypefn {Target Hook} int TARGET_SIMD_CLONE_COMPUTE_VECSIZE_AND_SIMDLEN 
> (struct cgraph_node *@var{}, struct cgraph_simd_clone *@var{}, @var{tree}, 
> @var{int}, @var{bool})
>  This hook should set @var{vecsize_mangle}, @var{vecsize_int}, 
> @var{vecsize_float}
>  fields in @var{simd_clone} structure pointed by @var{clone_info} argument 
> and also
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index eccc4d88493..b03ad4c97c6 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -4311,6 +4311,8 @@ address;  but often a machine-dependent strategy can 
> generate better code.
>
>  @hook TARGET_VECTORIZE_BUILTIN_SCATTER
>
> +@hook TARGET_VECTORIZE_PREFER_GATHER_SCATTER
> +
>  @hook TARGET_SIMD_CLONE_COMPUTE_VECSIZE_AND_SIMDLEN
>
>  @hook TARGET_SIMD_CLONE_ADJUST
> diff --git a/gcc/target.def b/gcc/target.def
> index 38903eb567a..dd57b7072af 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -2056,6 +2056,14 @@ all zeros.  GCC can then try to branch around the 
> instruction instead.",
>   (unsigned ifn),
>   default_empty_mask_is_expensive)
>
> +/* Prefer gather/scatter loads/stores to e.g. elementwise accesses if\n\
> +we cannot use a contiguous access.  */
> +DEFHOOKPOD
> +(prefer_gather_scatter,
> + "This hook is set to TRUE if gather loads or scatter stores are cheaper 
> on\n\
> +this target than a sequence of elementwise loads or stores.",
> + bool, false)
> +
>  /* Target builtin that implements vector gather operation.  */
>  DEFHOOK
>  (builtin_gather,
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 2e9b3d2e686..8ca33f5951a 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -2349,7 +2349,7 @@ get_group_load_store_type (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>       allows us to use contiguous accesses.  */
>    if ((*memory_access_type == VMAT_ELEMENTWISE
>         || *memory_access_type == VMAT_STRIDED_SLP)
> -      && single_element_p
> +      && (targetm.vectorize.prefer_gather_scatter || single_element_p)
>        && SLP_TREE_LANES (slp_node) == 1
>        && loop_vinfo
>        && vect_use_strided_gather_scatters_p (stmt_info, loop_vinfo,
> --
> 2.50.0
>

Reply via email to