> Am 23.07.2025 um 16:09 schrieb Andrew Stubbs <a...@baylibre.com>:
>
> On 23/07/2025 14:52, Richard Biener wrote:
>>> On Wed, Jul 23, 2025 at 3:24 PM Andrew Stubbs <a...@baylibre.com> wrote:
>>>
>>> On 23/07/2025 13:24, Richard Biener wrote:
>>>> On Wed, Jul 23, 2025 at 1:51 PM Andrew Stubbs <a...@baylibre.com> wrote:
>>>>>
>>>>> From: Julian Brown <jul...@codesourcery.com>
>>>>>
>>>>> This patch was originally written by Julian in 2021 for the OG10 branch,
>>>>> but does not appear to have been proposed for upstream at that time, or
>>>>> since. I've now forward ported it and retested it. Thomas reported
>>>>> test regressions with this patch on the OG14 branch, but I think it was
>>>>> exposing some bugs in the backend; I can't reproduce those failures on
>>>>> mainline.
>>>>>
>>>>> I'm not sure what the original motivating test case was, but I see that
>>>>> the gfortran.dg/vect/fast-math-pr37021.f90 testcase is reduced from ~24k
>>>>> lines of assembler down to <7k, on amdgcn.
>>>>>
>>>>> OK for mainline?
>>>>
>>>> I do wonder if the single_element_p check isn't for correctness? And how
>>>> the patch makes a difference when we still require SLP_TREE_LANES
>>>> (slp_node) == 1?
>>>
>>> The SLP_TREE_LANES thing was already true in the case where
>>> single_element_p was set, so I had assumed that was acceptable. I don't
>>> know enough about this stuff to completely understand what Julian has
>>> done here (and I don't suppose there's much point in asking him now,
>>> after all this time).
>>>
>>> Without this patch, the testcase has this in the "vect" dump:
>> Which testcase? It looks like we have a two element group and
>> a variable step, and the elements are used even/odd. But we don't
>> lower that yet (yeah, incomplete work...) and thus have a load permutation
>> still?
>
> Sorry, it was in my original message, but I should have repeated it.
>
> This is gfortran.dg/vect/fast-math-pr37021.f90. It's using vectors of complex
> numbers, hence the even/odd.
So you end up with two gathers, one for even, one for odd? Would be better if
combined and split even/odd after the load?
>>> _1417 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1415];
>>> ivtmp_1418 = ivtmp_1415 + _1147;
>>> _1419 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1418];
>>> ivtmp_1420 = ivtmp_1418 + _1147;
>>> _1421 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1420];
>>> ivtmp_1422 = ivtmp_1420 + _1147;
>>> _1423 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1422];
>>> ivtmp_1424 = ivtmp_1422 + _1147;
>>> _1425 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1424];
>>> ivtmp_1426 = ivtmp_1424 + _1147;
>>> _1427 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1426];
>>> :
>>> :
>>> [repeats 32 times]
>>> :
>>> :
>>> vect_cst__1481 = {_1417, _1419, _1421, _1423, _1425, _1427, ......
>>> _1482 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1480];
>>> ivtmp_1483 = ivtmp_1480 + _1147;
>>> _1484 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1483];
>>> ivtmp_1485 = ivtmp_1483 + _1147;
>>> _1486 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1485];
>>> ivtmp_1487 = ivtmp_1485 + _1147;
>>> _1488 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1487];
>>> ivtmp_1489 = ivtmp_1487 + _1147;
>>> :
>>> :
>>> [repeats 32 times]
>>> :
>>> :
>>> vect_cst__1545 = {_1482, _1484, _1486, _1488, _1490, _1492, ......
>>> vect__42.84_1546 = VEC_PERM_EXPR <vect_cst__1481, vect_cst__1545,
>>> { 2, 4, 6, 8, 10, 12, 14, 16, .....
>>>
>>> And with the patch:
>>>
>>> _472 = stride.3_51 * 16;
>>> _473 = (sizetype) _472;
>>> _474 = VEC_SERIES_EXPR <0, _473>;
>>> vect__42.66_504 = .MASK_GATHER_LOAD (vectp.64_501, _474, 1, { 0.0, ...
>>>
>>> The pattern occurs a few times in this testcase. (Although, there remain
>>> a number of cases where the vectorizer is falling back to elementwise
>>> BIT_FIELD_REF that I would prefer were done in some vectorized way.)
>> So in case you had both elements from the group used and thus only one
>> vector and no permute you'd still get the vector(2) loads and the CTOR.
>> I suppose you still want to use gather here, but then we either need to
>> use an element type as large as vector(2) real(kind=8) to be able to use
>> the same simple VEC_SERIES_EXPR or generate
>> VEC_SERIES_EXPR <0, _473> and VEC_SERIES_EXPR <1, _473>
>> and interleave those into an offset vector that goes { 0, 1, _473,
>> _473 + 1, 2*_473, 2*_473 + 1, ... }
>> for the case of SLP_TREE_LANES != 1 (or rather group_size != 1).
>
> If you can generate a vector of either addresses or offsets without going
> elementwise then we want to use a gather. Even building a offsets elementwise
> is better than eating 64 separate sequential memory latencies when you can
> parallelize that into a single instruction.
>
>> That said, the hook is a bit black/white - whether the target prefers
>> a gather/scatter over N piecewise operations with equal stride depends
>> at least on the vector mode. On x86_64 for V2DImode definitely no
>> gather, for V16SFmode it probably depends (V64QImode gather isn't
>> supported).
>
> I expected this one might come up. What features would you like to see in a
> hook? Should it be the mode, or the vectype? I see also masked_p that might
> be relevant to some architecture?
The mode, the (possibly constant) stride and the group size
> If the patch (and concept) is otherwise OK, I can get that done.
>
> Andrew
>
>> Richard.
>>>
>>>> Richard.
>>>>
>>>>> Andrew
>>>>>
>>>>> ------------
>>>>>
>>>>> For AMD GCN, the instructions available for loading/storing vectors are
>>>>> always scatter/gather operations (i.e. there are separate addresses for
>>>>> each vector lane), so the current heuristic to avoid gather/scatter
>>>>> operations with too many elements in get_group_load_store_type is
>>>>> counterproductive. Avoiding such operations in that function can
>>>>> subsequently lead to a missed vectorization opportunity whereby later
>>>>> analyses in the vectorizer try to use a very wide array type which is
>>>>> not available on this target, and thus it bails out.
>>>>>
>>>>> This patch adds a target hook to override the "single_element_p"
>>>>> heuristic in the function as a target hook, and activates it for GCN. This
>>>>> allows much better code to be generated for affected loops.
>>>>>
>>>>> Co-authored-by: Julian Brown <jul...@codesourcery.com>
>>>>>
>>>>> gcc/
>>>>> * doc/tm.texi.in (TARGET_VECTORIZE_PREFER_GATHER_SCATTER): Add
>>>>> documentation hook.
>>>>> * doc/tm.texi: Regenerate.
>>>>> * target.def (prefer_gather_scatter): Add target hook under
>>>>> vectorizer.
>>>>> * tree-vect-stmts.cc (get_group_load_store_type): Optionally
>>>>> prefer
>>>>> gather/scatter instructions to scalar/elementwise fallback.
>>>>> * config/gcn/gcn.cc (TARGET_VECTORIZE_PREFER_GATHER_SCATTER):
>>>>> Define
>>>>> hook.
>>>>> ---
>>>>> gcc/config/gcn/gcn.cc | 2 ++
>>>>> gcc/doc/tm.texi | 5 +++++
>>>>> gcc/doc/tm.texi.in | 2 ++
>>>>> gcc/target.def | 8 ++++++++
>>>>> gcc/tree-vect-stmts.cc | 2 +-
>>>>> 5 files changed, 18 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
>>>>> index 3b26d5c6a58..d451bf43355 100644
>>>>> --- a/gcc/config/gcn/gcn.cc
>>>>> +++ b/gcc/config/gcn/gcn.cc
>>>>> @@ -7998,6 +7998,8 @@ gcn_dwarf_register_span (rtx rtl)
>>>>> gcn_vector_alignment_reachable
>>>>> #undef TARGET_VECTOR_MODE_SUPPORTED_P
>>>>> #define TARGET_VECTOR_MODE_SUPPORTED_P gcn_vector_mode_supported_p
>>>>> +#undef TARGET_VECTORIZE_PREFER_GATHER_SCATTER
>>>>> +#define TARGET_VECTORIZE_PREFER_GATHER_SCATTER true
>>>>>
>>>>> #undef TARGET_DOCUMENTATION_NAME
>>>>> #define TARGET_DOCUMENTATION_NAME "AMD GCN"
>>>>> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
>>>>> index 5e305643b3a..29177d81466 100644
>>>>> --- a/gcc/doc/tm.texi
>>>>> +++ b/gcc/doc/tm.texi
>>>>> @@ -6511,6 +6511,11 @@ The default is @code{NULL_TREE} which means to not
>>>>> vectorize scatter
>>>>> stores.
>>>>> @end deftypefn
>>>>>
>>>>> +@deftypevr {Target Hook} bool TARGET_VECTORIZE_PREFER_GATHER_SCATTER
>>>>> +This hook is set to TRUE if gather loads or scatter stores are cheaper on
>>>>> +this target than a sequence of elementwise loads or stores.
>>>>> +@end deftypevr
>>>>> +
>>>>> @deftypefn {Target Hook} int
>>>>> TARGET_SIMD_CLONE_COMPUTE_VECSIZE_AND_SIMDLEN (struct cgraph_node
>>>>> *@var{}, struct cgraph_simd_clone *@var{}, @var{tree}, @var{int},
>>>>> @var{bool})
>>>>> This hook should set @var{vecsize_mangle}, @var{vecsize_int},
>>>>> @var{vecsize_float}
>>>>> fields in @var{simd_clone} structure pointed by @var{clone_info}
>>>>> argument and also
>>>>> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
>>>>> index eccc4d88493..b03ad4c97c6 100644
>>>>> --- a/gcc/doc/tm.texi.in
>>>>> +++ b/gcc/doc/tm.texi.in
>>>>> @@ -4311,6 +4311,8 @@ address; but often a machine-dependent strategy
>>>>> can generate better code.
>>>>>
>>>>> @hook TARGET_VECTORIZE_BUILTIN_SCATTER
>>>>>
>>>>> +@hook TARGET_VECTORIZE_PREFER_GATHER_SCATTER
>>>>> +
>>>>> @hook TARGET_SIMD_CLONE_COMPUTE_VECSIZE_AND_SIMDLEN
>>>>>
>>>>> @hook TARGET_SIMD_CLONE_ADJUST
>>>>> diff --git a/gcc/target.def b/gcc/target.def
>>>>> index 38903eb567a..dd57b7072af 100644
>>>>> --- a/gcc/target.def
>>>>> +++ b/gcc/target.def
>>>>> @@ -2056,6 +2056,14 @@ all zeros. GCC can then try to branch around the
>>>>> instruction instead.",
>>>>> (unsigned ifn),
>>>>> default_empty_mask_is_expensive)
>>>>>
>>>>> +/* Prefer gather/scatter loads/stores to e.g. elementwise accesses if\n\
>>>>> +we cannot use a contiguous access. */
>>>>> +DEFHOOKPOD
>>>>> +(prefer_gather_scatter,
>>>>> + "This hook is set to TRUE if gather loads or scatter stores are cheaper
>>>>> on\n\
>>>>> +this target than a sequence of elementwise loads or stores.",
>>>>> + bool, false)
>>>>> +
>>>>> /* Target builtin that implements vector gather operation. */
>>>>> DEFHOOK
>>>>> (builtin_gather,
>>>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
>>>>> index 2e9b3d2e686..8ca33f5951a 100644
>>>>> --- a/gcc/tree-vect-stmts.cc
>>>>> +++ b/gcc/tree-vect-stmts.cc
>>>>> @@ -2349,7 +2349,7 @@ get_group_load_store_type (vec_info *vinfo,
>>>>> stmt_vec_info stmt_info,
>>>>> allows us to use contiguous accesses. */
>>>>> if ((*memory_access_type == VMAT_ELEMENTWISE
>>>>> || *memory_access_type == VMAT_STRIDED_SLP)
>>>>> - && single_element_p
>>>>> + && (targetm.vectorize.prefer_gather_scatter || single_element_p)
>>>>> && SLP_TREE_LANES (slp_node) == 1
>>>>> && loop_vinfo
>>>>> && vect_use_strided_gather_scatters_p (stmt_info, loop_vinfo,
>>>>> --
>>>>> 2.50.0
>>>>>
>>>
>