> Am 23.07.2025 um 16:09 schrieb Andrew Stubbs <a...@baylibre.com>:
> 
> On 23/07/2025 14:52, Richard Biener wrote:
>>> On Wed, Jul 23, 2025 at 3:24 PM Andrew Stubbs <a...@baylibre.com> wrote:
>>> 
>>> On 23/07/2025 13:24, Richard Biener wrote:
>>>> On Wed, Jul 23, 2025 at 1:51 PM Andrew Stubbs <a...@baylibre.com> wrote:
>>>>> 
>>>>> From: Julian Brown <jul...@codesourcery.com>
>>>>> 
>>>>> This patch was originally written by Julian in 2021 for the OG10 branch,
>>>>> but does not appear to have been proposed for upstream at that time, or
>>>>> since.  I've now forward ported it and retested it.  Thomas reported
>>>>> test regressions with this patch on the OG14 branch, but I think it was
>>>>> exposing some bugs in the backend; I can't reproduce those failures on
>>>>> mainline.
>>>>> 
>>>>> I'm not sure what the original motivating test case was, but I see that
>>>>> the gfortran.dg/vect/fast-math-pr37021.f90 testcase is reduced from ~24k
>>>>> lines of assembler down to <7k, on amdgcn.
>>>>> 
>>>>> OK for mainline?
>>>> 
>>>> I do wonder if the single_element_p check isn't for correctness?  And how
>>>> the patch makes a difference when we still require SLP_TREE_LANES
>>>> (slp_node) == 1?
>>> 
>>> The SLP_TREE_LANES thing was already true in the case where
>>> single_element_p was set, so I had assumed that was acceptable.  I don't
>>> know enough about this stuff to completely understand what Julian has
>>> done here (and I don't suppose there's much point in asking him now,
>>> after all this time).
>>> 
>>> Without this patch, the testcase has this in the "vect" dump:
>> Which testcase?  It looks like we have a two element group and
>> a variable step, and the elements are used even/odd.   But we don't
>> lower that yet (yeah, incomplete work...) and thus have a load permutation
>> still?
> 
> Sorry, it was in my original message, but I should have repeated it.
> 
> This is gfortran.dg/vect/fast-math-pr37021.f90. It's using vectors of complex 
> numbers, hence the even/odd.

So you end up with two gathers, one for even, one for odd?  Would be better if 
combined and split even/odd after the load?

>>>    _1417 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1415];
>>>    ivtmp_1418 = ivtmp_1415 + _1147;
>>>    _1419 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1418];
>>>    ivtmp_1420 = ivtmp_1418 + _1147;
>>>    _1421 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1420];
>>>    ivtmp_1422 = ivtmp_1420 + _1147;
>>>    _1423 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1422];
>>>    ivtmp_1424 = ivtmp_1422 + _1147;
>>>    _1425 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1424];
>>>    ivtmp_1426 = ivtmp_1424 + _1147;
>>>    _1427 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1426];
>>>     :
>>>     :
>>>    [repeats 32 times]
>>>     :
>>>     :
>>>    vect_cst__1481 = {_1417, _1419, _1421, _1423, _1425, _1427, ......
>>>    _1482 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1480];
>>>    ivtmp_1483 = ivtmp_1480 + _1147;
>>>    _1484 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1483];
>>>    ivtmp_1485 = ivtmp_1483 + _1147;
>>>    _1486 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1485];
>>>    ivtmp_1487 = ivtmp_1485 + _1147;
>>>    _1488 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1487];
>>>    ivtmp_1489 = ivtmp_1487 + _1147;
>>>     :
>>>     :
>>>    [repeats 32 times]
>>>     :
>>>     :
>>>    vect_cst__1545 = {_1482, _1484, _1486, _1488, _1490, _1492, ......
>>>    vect__42.84_1546 = VEC_PERM_EXPR <vect_cst__1481, vect_cst__1545,
>>>               { 2, 4, 6, 8, 10, 12, 14, 16, .....
>>> 
>>> And with the patch:
>>> 
>>>    _472 = stride.3_51 * 16;
>>>    _473 = (sizetype) _472;
>>>    _474 = VEC_SERIES_EXPR <0, _473>;
>>>    vect__42.66_504 = .MASK_GATHER_LOAD (vectp.64_501, _474, 1, { 0.0, ...
>>> 
>>> The pattern occurs a few times in this testcase. (Although, there remain
>>> a number of cases where the vectorizer is falling back to elementwise
>>> BIT_FIELD_REF that I would prefer were done in some vectorized way.)
>> So in case you had both elements from the group used and thus only one
>> vector and no permute you'd still get the vector(2) loads and the CTOR.
>> I suppose you still want to use gather here, but then we either need to
>> use an element type as large as vector(2) real(kind=8) to be able to use
>> the same simple VEC_SERIES_EXPR or generate
>> VEC_SERIES_EXPR <0, _473> and VEC_SERIES_EXPR <1, _473>
>> and interleave those into an offset vector that goes { 0, 1, _473,
>> _473 + 1, 2*_473, 2*_473 + 1, ... }
>> for the case of SLP_TREE_LANES != 1 (or rather group_size != 1).
> 
> If you can generate a vector of either addresses or offsets without going 
> elementwise then we want to use a gather. Even building a offsets elementwise 
> is better than eating 64 separate sequential memory latencies when you can 
> parallelize that into a single instruction.
> 
>> That said, the hook is a bit black/white - whether the target prefers
>> a gather/scatter over N piecewise operations with equal stride depends
>> at least on the vector mode.  On x86_64 for V2DImode definitely no
>> gather, for V16SFmode it probably depends (V64QImode gather isn't
>> supported).
> 
> I expected this one might come up.  What features would you like to see in a 
> hook?  Should it be the mode, or the vectype? I see also masked_p that might 
> be relevant to some architecture?

The mode, the (possibly constant) stride and the group size

> If the patch (and concept) is otherwise OK, I can get that done.
> 
> Andrew
> 
>> Richard.
>>> 
>>>> Richard.
>>>> 
>>>>> Andrew
>>>>> 
>>>>> ------------
>>>>> 
>>>>> For AMD GCN, the instructions available for loading/storing vectors are
>>>>> always scatter/gather operations (i.e. there are separate addresses for
>>>>> each vector lane), so the current heuristic to avoid gather/scatter
>>>>> operations with too many elements in get_group_load_store_type is
>>>>> counterproductive. Avoiding such operations in that function can
>>>>> subsequently lead to a missed vectorization opportunity whereby later
>>>>> analyses in the vectorizer try to use a very wide array type which is
>>>>> not available on this target, and thus it bails out.
>>>>> 
>>>>> This patch adds a target hook to override the "single_element_p"
>>>>> heuristic in the function as a target hook, and activates it for GCN. This
>>>>> allows much better code to be generated for affected loops.
>>>>> 
>>>>> Co-authored-by:  Julian Brown  <jul...@codesourcery.com>
>>>>> 
>>>>> gcc/
>>>>>          * doc/tm.texi.in (TARGET_VECTORIZE_PREFER_GATHER_SCATTER): Add
>>>>>          documentation hook.
>>>>>          * doc/tm.texi: Regenerate.
>>>>>          * target.def (prefer_gather_scatter): Add target hook under 
>>>>> vectorizer.
>>>>>          * tree-vect-stmts.cc (get_group_load_store_type): Optionally 
>>>>> prefer
>>>>>          gather/scatter instructions to scalar/elementwise fallback.
>>>>>          * config/gcn/gcn.cc (TARGET_VECTORIZE_PREFER_GATHER_SCATTER): 
>>>>> Define
>>>>>          hook.
>>>>> ---
>>>>>   gcc/config/gcn/gcn.cc  | 2 ++
>>>>>   gcc/doc/tm.texi        | 5 +++++
>>>>>   gcc/doc/tm.texi.in     | 2 ++
>>>>>   gcc/target.def         | 8 ++++++++
>>>>>   gcc/tree-vect-stmts.cc | 2 +-
>>>>>   5 files changed, 18 insertions(+), 1 deletion(-)
>>>>> 
>>>>> diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
>>>>> index 3b26d5c6a58..d451bf43355 100644
>>>>> --- a/gcc/config/gcn/gcn.cc
>>>>> +++ b/gcc/config/gcn/gcn.cc
>>>>> @@ -7998,6 +7998,8 @@ gcn_dwarf_register_span (rtx rtl)
>>>>>     gcn_vector_alignment_reachable
>>>>>   #undef  TARGET_VECTOR_MODE_SUPPORTED_P
>>>>>   #define TARGET_VECTOR_MODE_SUPPORTED_P gcn_vector_mode_supported_p
>>>>> +#undef  TARGET_VECTORIZE_PREFER_GATHER_SCATTER
>>>>> +#define TARGET_VECTORIZE_PREFER_GATHER_SCATTER true
>>>>> 
>>>>>   #undef TARGET_DOCUMENTATION_NAME
>>>>>   #define TARGET_DOCUMENTATION_NAME "AMD GCN"
>>>>> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
>>>>> index 5e305643b3a..29177d81466 100644
>>>>> --- a/gcc/doc/tm.texi
>>>>> +++ b/gcc/doc/tm.texi
>>>>> @@ -6511,6 +6511,11 @@ The default is @code{NULL_TREE} which means to not 
>>>>> vectorize scatter
>>>>>   stores.
>>>>>   @end deftypefn
>>>>> 
>>>>> +@deftypevr {Target Hook} bool TARGET_VECTORIZE_PREFER_GATHER_SCATTER
>>>>> +This hook is set to TRUE if gather loads or scatter stores are cheaper on
>>>>> +this target than a sequence of elementwise loads or stores.
>>>>> +@end deftypevr
>>>>> +
>>>>>   @deftypefn {Target Hook} int 
>>>>> TARGET_SIMD_CLONE_COMPUTE_VECSIZE_AND_SIMDLEN (struct cgraph_node 
>>>>> *@var{}, struct cgraph_simd_clone *@var{}, @var{tree}, @var{int}, 
>>>>> @var{bool})
>>>>>   This hook should set @var{vecsize_mangle}, @var{vecsize_int}, 
>>>>> @var{vecsize_float}
>>>>>   fields in @var{simd_clone} structure pointed by @var{clone_info} 
>>>>> argument and also
>>>>> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
>>>>> index eccc4d88493..b03ad4c97c6 100644
>>>>> --- a/gcc/doc/tm.texi.in
>>>>> +++ b/gcc/doc/tm.texi.in
>>>>> @@ -4311,6 +4311,8 @@ address;  but often a machine-dependent strategy 
>>>>> can generate better code.
>>>>> 
>>>>>   @hook TARGET_VECTORIZE_BUILTIN_SCATTER
>>>>> 
>>>>> +@hook TARGET_VECTORIZE_PREFER_GATHER_SCATTER
>>>>> +
>>>>>   @hook TARGET_SIMD_CLONE_COMPUTE_VECSIZE_AND_SIMDLEN
>>>>> 
>>>>>   @hook TARGET_SIMD_CLONE_ADJUST
>>>>> diff --git a/gcc/target.def b/gcc/target.def
>>>>> index 38903eb567a..dd57b7072af 100644
>>>>> --- a/gcc/target.def
>>>>> +++ b/gcc/target.def
>>>>> @@ -2056,6 +2056,14 @@ all zeros.  GCC can then try to branch around the 
>>>>> instruction instead.",
>>>>>    (unsigned ifn),
>>>>>    default_empty_mask_is_expensive)
>>>>> 
>>>>> +/* Prefer gather/scatter loads/stores to e.g. elementwise accesses if\n\
>>>>> +we cannot use a contiguous access.  */
>>>>> +DEFHOOKPOD
>>>>> +(prefer_gather_scatter,
>>>>> + "This hook is set to TRUE if gather loads or scatter stores are cheaper 
>>>>> on\n\
>>>>> +this target than a sequence of elementwise loads or stores.",
>>>>> + bool, false)
>>>>> +
>>>>>   /* Target builtin that implements vector gather operation.  */
>>>>>   DEFHOOK
>>>>>   (builtin_gather,
>>>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
>>>>> index 2e9b3d2e686..8ca33f5951a 100644
>>>>> --- a/gcc/tree-vect-stmts.cc
>>>>> +++ b/gcc/tree-vect-stmts.cc
>>>>> @@ -2349,7 +2349,7 @@ get_group_load_store_type (vec_info *vinfo, 
>>>>> stmt_vec_info stmt_info,
>>>>>        allows us to use contiguous accesses.  */
>>>>>     if ((*memory_access_type == VMAT_ELEMENTWISE
>>>>>          || *memory_access_type == VMAT_STRIDED_SLP)
>>>>> -      && single_element_p
>>>>> +      && (targetm.vectorize.prefer_gather_scatter || single_element_p)
>>>>>         && SLP_TREE_LANES (slp_node) == 1
>>>>>         && loop_vinfo
>>>>>         && vect_use_strided_gather_scatters_p (stmt_info, loop_vinfo,
>>>>> --
>>>>> 2.50.0
>>>>> 
>>> 
> 

Reply via email to