On Wed, Jul 23, 2025 at 3:24 PM Andrew Stubbs <a...@baylibre.com> wrote: > > On 23/07/2025 13:24, Richard Biener wrote: > > On Wed, Jul 23, 2025 at 1:51 PM Andrew Stubbs <a...@baylibre.com> wrote: > >> > >> From: Julian Brown <jul...@codesourcery.com> > >> > >> This patch was originally written by Julian in 2021 for the OG10 branch, > >> but does not appear to have been proposed for upstream at that time, or > >> since. I've now forward ported it and retested it. Thomas reported > >> test regressions with this patch on the OG14 branch, but I think it was > >> exposing some bugs in the backend; I can't reproduce those failures on > >> mainline. > >> > >> I'm not sure what the original motivating test case was, but I see that > >> the gfortran.dg/vect/fast-math-pr37021.f90 testcase is reduced from ~24k > >> lines of assembler down to <7k, on amdgcn. > >> > >> OK for mainline? > > > > I do wonder if the single_element_p check isn't for correctness? And how > > the patch makes a difference when we still require SLP_TREE_LANES > > (slp_node) == 1? > > The SLP_TREE_LANES thing was already true in the case where > single_element_p was set, so I had assumed that was acceptable. I don't > know enough about this stuff to completely understand what Julian has > done here (and I don't suppose there's much point in asking him now, > after all this time). > > Without this patch, the testcase has this in the "vect" dump:
Which testcase? It looks like we have a two element group and a variable step, and the elements are used even/odd. But we don't lower that yet (yeah, incomplete work...) and thus have a load permutation still? > _1417 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1415]; > ivtmp_1418 = ivtmp_1415 + _1147; > _1419 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1418]; > ivtmp_1420 = ivtmp_1418 + _1147; > _1421 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1420]; > ivtmp_1422 = ivtmp_1420 + _1147; > _1423 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1422]; > ivtmp_1424 = ivtmp_1422 + _1147; > _1425 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1424]; > ivtmp_1426 = ivtmp_1424 + _1147; > _1427 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1426]; > : > : > [repeats 32 times] > : > : > vect_cst__1481 = {_1417, _1419, _1421, _1423, _1425, _1427, ...... > _1482 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1480]; > ivtmp_1483 = ivtmp_1480 + _1147; > _1484 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1483]; > ivtmp_1485 = ivtmp_1483 + _1147; > _1486 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1485]; > ivtmp_1487 = ivtmp_1485 + _1147; > _1488 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1487]; > ivtmp_1489 = ivtmp_1487 + _1147; > : > : > [repeats 32 times] > : > : > vect_cst__1545 = {_1482, _1484, _1486, _1488, _1490, _1492, ...... > vect__42.84_1546 = VEC_PERM_EXPR <vect_cst__1481, vect_cst__1545, > { 2, 4, 6, 8, 10, 12, 14, 16, ..... > > And with the patch: > > _472 = stride.3_51 * 16; > _473 = (sizetype) _472; > _474 = VEC_SERIES_EXPR <0, _473>; > vect__42.66_504 = .MASK_GATHER_LOAD (vectp.64_501, _474, 1, { 0.0, ... > > The pattern occurs a few times in this testcase. (Although, there remain > a number of cases where the vectorizer is falling back to elementwise > BIT_FIELD_REF that I would prefer were done in some vectorized way.) So in case you had both elements from the group used and thus only one vector and no permute you'd still get the vector(2) loads and the CTOR. I suppose you still want to use gather here, but then we either need to use an element type as large as vector(2) real(kind=8) to be able to use the same simple VEC_SERIES_EXPR or generate VEC_SERIES_EXPR <0, _473> and VEC_SERIES_EXPR <1, _473> and interleave those into an offset vector that goes { 0, 1, _473, _473 + 1, 2*_473, 2*_473 + 1, ... } for the case of SLP_TREE_LANES != 1 (or rather group_size != 1). That said, the hook is a bit black/white - whether the target prefers a gather/scatter over N piecewise operations with equal stride depends at least on the vector mode. On x86_64 for V2DImode definitely no gather, for V16SFmode it probably depends (V64QImode gather isn't supported). Richard. > > > Richard. > > > >> Andrew > >> > >> ------------ > >> > >> For AMD GCN, the instructions available for loading/storing vectors are > >> always scatter/gather operations (i.e. there are separate addresses for > >> each vector lane), so the current heuristic to avoid gather/scatter > >> operations with too many elements in get_group_load_store_type is > >> counterproductive. Avoiding such operations in that function can > >> subsequently lead to a missed vectorization opportunity whereby later > >> analyses in the vectorizer try to use a very wide array type which is > >> not available on this target, and thus it bails out. > >> > >> This patch adds a target hook to override the "single_element_p" > >> heuristic in the function as a target hook, and activates it for GCN. This > >> allows much better code to be generated for affected loops. > >> > >> Co-authored-by: Julian Brown <jul...@codesourcery.com> > >> > >> gcc/ > >> * doc/tm.texi.in (TARGET_VECTORIZE_PREFER_GATHER_SCATTER): Add > >> documentation hook. > >> * doc/tm.texi: Regenerate. > >> * target.def (prefer_gather_scatter): Add target hook under > >> vectorizer. > >> * tree-vect-stmts.cc (get_group_load_store_type): Optionally > >> prefer > >> gather/scatter instructions to scalar/elementwise fallback. > >> * config/gcn/gcn.cc (TARGET_VECTORIZE_PREFER_GATHER_SCATTER): > >> Define > >> hook. > >> --- > >> gcc/config/gcn/gcn.cc | 2 ++ > >> gcc/doc/tm.texi | 5 +++++ > >> gcc/doc/tm.texi.in | 2 ++ > >> gcc/target.def | 8 ++++++++ > >> gcc/tree-vect-stmts.cc | 2 +- > >> 5 files changed, 18 insertions(+), 1 deletion(-) > >> > >> diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc > >> index 3b26d5c6a58..d451bf43355 100644 > >> --- a/gcc/config/gcn/gcn.cc > >> +++ b/gcc/config/gcn/gcn.cc > >> @@ -7998,6 +7998,8 @@ gcn_dwarf_register_span (rtx rtl) > >> gcn_vector_alignment_reachable > >> #undef TARGET_VECTOR_MODE_SUPPORTED_P > >> #define TARGET_VECTOR_MODE_SUPPORTED_P gcn_vector_mode_supported_p > >> +#undef TARGET_VECTORIZE_PREFER_GATHER_SCATTER > >> +#define TARGET_VECTORIZE_PREFER_GATHER_SCATTER true > >> > >> #undef TARGET_DOCUMENTATION_NAME > >> #define TARGET_DOCUMENTATION_NAME "AMD GCN" > >> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi > >> index 5e305643b3a..29177d81466 100644 > >> --- a/gcc/doc/tm.texi > >> +++ b/gcc/doc/tm.texi > >> @@ -6511,6 +6511,11 @@ The default is @code{NULL_TREE} which means to not > >> vectorize scatter > >> stores. > >> @end deftypefn > >> > >> +@deftypevr {Target Hook} bool TARGET_VECTORIZE_PREFER_GATHER_SCATTER > >> +This hook is set to TRUE if gather loads or scatter stores are cheaper on > >> +this target than a sequence of elementwise loads or stores. > >> +@end deftypevr > >> + > >> @deftypefn {Target Hook} int > >> TARGET_SIMD_CLONE_COMPUTE_VECSIZE_AND_SIMDLEN (struct cgraph_node *@var{}, > >> struct cgraph_simd_clone *@var{}, @var{tree}, @var{int}, @var{bool}) > >> This hook should set @var{vecsize_mangle}, @var{vecsize_int}, > >> @var{vecsize_float} > >> fields in @var{simd_clone} structure pointed by @var{clone_info} > >> argument and also > >> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in > >> index eccc4d88493..b03ad4c97c6 100644 > >> --- a/gcc/doc/tm.texi.in > >> +++ b/gcc/doc/tm.texi.in > >> @@ -4311,6 +4311,8 @@ address; but often a machine-dependent strategy can > >> generate better code. > >> > >> @hook TARGET_VECTORIZE_BUILTIN_SCATTER > >> > >> +@hook TARGET_VECTORIZE_PREFER_GATHER_SCATTER > >> + > >> @hook TARGET_SIMD_CLONE_COMPUTE_VECSIZE_AND_SIMDLEN > >> > >> @hook TARGET_SIMD_CLONE_ADJUST > >> diff --git a/gcc/target.def b/gcc/target.def > >> index 38903eb567a..dd57b7072af 100644 > >> --- a/gcc/target.def > >> +++ b/gcc/target.def > >> @@ -2056,6 +2056,14 @@ all zeros. GCC can then try to branch around the > >> instruction instead.", > >> (unsigned ifn), > >> default_empty_mask_is_expensive) > >> > >> +/* Prefer gather/scatter loads/stores to e.g. elementwise accesses if\n\ > >> +we cannot use a contiguous access. */ > >> +DEFHOOKPOD > >> +(prefer_gather_scatter, > >> + "This hook is set to TRUE if gather loads or scatter stores are cheaper > >> on\n\ > >> +this target than a sequence of elementwise loads or stores.", > >> + bool, false) > >> + > >> /* Target builtin that implements vector gather operation. */ > >> DEFHOOK > >> (builtin_gather, > >> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > >> index 2e9b3d2e686..8ca33f5951a 100644 > >> --- a/gcc/tree-vect-stmts.cc > >> +++ b/gcc/tree-vect-stmts.cc > >> @@ -2349,7 +2349,7 @@ get_group_load_store_type (vec_info *vinfo, > >> stmt_vec_info stmt_info, > >> allows us to use contiguous accesses. */ > >> if ((*memory_access_type == VMAT_ELEMENTWISE > >> || *memory_access_type == VMAT_STRIDED_SLP) > >> - && single_element_p > >> + && (targetm.vectorize.prefer_gather_scatter || single_element_p) > >> && SLP_TREE_LANES (slp_node) == 1 > >> && loop_vinfo > >> && vect_use_strided_gather_scatters_p (stmt_info, loop_vinfo, > >> -- > >> 2.50.0 > >> >