On Wed, Jul 23, 2025 at 1:51 PM Andrew Stubbs <a...@baylibre.com> wrote: > > From: Julian Brown <jul...@codesourcery.com> > > This patch was originally written by Julian in 2021 for the OG10 branch, > but does not appear to have been proposed for upstream at that time, or > since. I've now forward ported it and retested it. Thomas reported > test regressions with this patch on the OG14 branch, but I think it was > exposing some bugs in the backend; I can't reproduce those failures on > mainline. > > I'm not sure what the original motivating test case was, but I see that > the gfortran.dg/vect/fast-math-pr37021.f90 testcase is reduced from ~24k > lines of assembler down to <7k, on amdgcn. > > OK for mainline?
I do wonder if the single_element_p check isn't for correctness? And how the patch makes a difference when we still require SLP_TREE_LANES (slp_node) == 1? Richard. > Andrew > > ------------ > > For AMD GCN, the instructions available for loading/storing vectors are > always scatter/gather operations (i.e. there are separate addresses for > each vector lane), so the current heuristic to avoid gather/scatter > operations with too many elements in get_group_load_store_type is > counterproductive. Avoiding such operations in that function can > subsequently lead to a missed vectorization opportunity whereby later > analyses in the vectorizer try to use a very wide array type which is > not available on this target, and thus it bails out. > > This patch adds a target hook to override the "single_element_p" > heuristic in the function as a target hook, and activates it for GCN. This > allows much better code to be generated for affected loops. > > Co-authored-by: Julian Brown <jul...@codesourcery.com> > > gcc/ > * doc/tm.texi.in (TARGET_VECTORIZE_PREFER_GATHER_SCATTER): Add > documentation hook. > * doc/tm.texi: Regenerate. > * target.def (prefer_gather_scatter): Add target hook under > vectorizer. > * tree-vect-stmts.cc (get_group_load_store_type): Optionally prefer > gather/scatter instructions to scalar/elementwise fallback. > * config/gcn/gcn.cc (TARGET_VECTORIZE_PREFER_GATHER_SCATTER): Define > hook. > --- > gcc/config/gcn/gcn.cc | 2 ++ > gcc/doc/tm.texi | 5 +++++ > gcc/doc/tm.texi.in | 2 ++ > gcc/target.def | 8 ++++++++ > gcc/tree-vect-stmts.cc | 2 +- > 5 files changed, 18 insertions(+), 1 deletion(-) > > diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc > index 3b26d5c6a58..d451bf43355 100644 > --- a/gcc/config/gcn/gcn.cc > +++ b/gcc/config/gcn/gcn.cc > @@ -7998,6 +7998,8 @@ gcn_dwarf_register_span (rtx rtl) > gcn_vector_alignment_reachable > #undef TARGET_VECTOR_MODE_SUPPORTED_P > #define TARGET_VECTOR_MODE_SUPPORTED_P gcn_vector_mode_supported_p > +#undef TARGET_VECTORIZE_PREFER_GATHER_SCATTER > +#define TARGET_VECTORIZE_PREFER_GATHER_SCATTER true > > #undef TARGET_DOCUMENTATION_NAME > #define TARGET_DOCUMENTATION_NAME "AMD GCN" > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi > index 5e305643b3a..29177d81466 100644 > --- a/gcc/doc/tm.texi > +++ b/gcc/doc/tm.texi > @@ -6511,6 +6511,11 @@ The default is @code{NULL_TREE} which means to not > vectorize scatter > stores. > @end deftypefn > > +@deftypevr {Target Hook} bool TARGET_VECTORIZE_PREFER_GATHER_SCATTER > +This hook is set to TRUE if gather loads or scatter stores are cheaper on > +this target than a sequence of elementwise loads or stores. > +@end deftypevr > + > @deftypefn {Target Hook} int TARGET_SIMD_CLONE_COMPUTE_VECSIZE_AND_SIMDLEN > (struct cgraph_node *@var{}, struct cgraph_simd_clone *@var{}, @var{tree}, > @var{int}, @var{bool}) > This hook should set @var{vecsize_mangle}, @var{vecsize_int}, > @var{vecsize_float} > fields in @var{simd_clone} structure pointed by @var{clone_info} argument > and also > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in > index eccc4d88493..b03ad4c97c6 100644 > --- a/gcc/doc/tm.texi.in > +++ b/gcc/doc/tm.texi.in > @@ -4311,6 +4311,8 @@ address; but often a machine-dependent strategy can > generate better code. > > @hook TARGET_VECTORIZE_BUILTIN_SCATTER > > +@hook TARGET_VECTORIZE_PREFER_GATHER_SCATTER > + > @hook TARGET_SIMD_CLONE_COMPUTE_VECSIZE_AND_SIMDLEN > > @hook TARGET_SIMD_CLONE_ADJUST > diff --git a/gcc/target.def b/gcc/target.def > index 38903eb567a..dd57b7072af 100644 > --- a/gcc/target.def > +++ b/gcc/target.def > @@ -2056,6 +2056,14 @@ all zeros. GCC can then try to branch around the > instruction instead.", > (unsigned ifn), > default_empty_mask_is_expensive) > > +/* Prefer gather/scatter loads/stores to e.g. elementwise accesses if\n\ > +we cannot use a contiguous access. */ > +DEFHOOKPOD > +(prefer_gather_scatter, > + "This hook is set to TRUE if gather loads or scatter stores are cheaper > on\n\ > +this target than a sequence of elementwise loads or stores.", > + bool, false) > + > /* Target builtin that implements vector gather operation. */ > DEFHOOK > (builtin_gather, > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > index 2e9b3d2e686..8ca33f5951a 100644 > --- a/gcc/tree-vect-stmts.cc > +++ b/gcc/tree-vect-stmts.cc > @@ -2349,7 +2349,7 @@ get_group_load_store_type (vec_info *vinfo, > stmt_vec_info stmt_info, > allows us to use contiguous accesses. */ > if ((*memory_access_type == VMAT_ELEMENTWISE > || *memory_access_type == VMAT_STRIDED_SLP) > - && single_element_p > + && (targetm.vectorize.prefer_gather_scatter || single_element_p) > && SLP_TREE_LANES (slp_node) == 1 > && loop_vinfo > && vect_use_strided_gather_scatters_p (stmt_info, loop_vinfo, > -- > 2.50.0 >