On 23/07/2025 14:52, Richard Biener wrote:
On Wed, Jul 23, 2025 at 3:24 PM Andrew Stubbs <a...@baylibre.com> wrote:
On 23/07/2025 13:24, Richard Biener wrote:
On Wed, Jul 23, 2025 at 1:51 PM Andrew Stubbs <a...@baylibre.com> wrote:
From: Julian Brown <jul...@codesourcery.com>
This patch was originally written by Julian in 2021 for the OG10 branch,
but does not appear to have been proposed for upstream at that time, or
since. I've now forward ported it and retested it. Thomas reported
test regressions with this patch on the OG14 branch, but I think it was
exposing some bugs in the backend; I can't reproduce those failures on
mainline.
I'm not sure what the original motivating test case was, but I see that
the gfortran.dg/vect/fast-math-pr37021.f90 testcase is reduced from ~24k
lines of assembler down to <7k, on amdgcn.
OK for mainline?
I do wonder if the single_element_p check isn't for correctness? And how
the patch makes a difference when we still require SLP_TREE_LANES
(slp_node) == 1?
The SLP_TREE_LANES thing was already true in the case where
single_element_p was set, so I had assumed that was acceptable. I don't
know enough about this stuff to completely understand what Julian has
done here (and I don't suppose there's much point in asking him now,
after all this time).
Without this patch, the testcase has this in the "vect" dump:
Which testcase? It looks like we have a two element group and
a variable step, and the elements are used even/odd. But we don't
lower that yet (yeah, incomplete work...) and thus have a load permutation
still?
Sorry, it was in my original message, but I should have repeated it.
This is gfortran.dg/vect/fast-math-pr37021.f90. It's using vectors of
complex numbers, hence the even/odd.
_1417 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1415];
ivtmp_1418 = ivtmp_1415 + _1147;
_1419 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1418];
ivtmp_1420 = ivtmp_1418 + _1147;
_1421 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1420];
ivtmp_1422 = ivtmp_1420 + _1147;
_1423 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1422];
ivtmp_1424 = ivtmp_1422 + _1147;
_1425 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1424];
ivtmp_1426 = ivtmp_1424 + _1147;
_1427 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1426];
:
:
[repeats 32 times]
:
:
vect_cst__1481 = {_1417, _1419, _1421, _1423, _1425, _1427, ......
_1482 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1480];
ivtmp_1483 = ivtmp_1480 + _1147;
_1484 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1483];
ivtmp_1485 = ivtmp_1483 + _1147;
_1486 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1485];
ivtmp_1487 = ivtmp_1485 + _1147;
_1488 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)ivtmp_1487];
ivtmp_1489 = ivtmp_1487 + _1147;
:
:
[repeats 32 times]
:
:
vect_cst__1545 = {_1482, _1484, _1486, _1488, _1490, _1492, ......
vect__42.84_1546 = VEC_PERM_EXPR <vect_cst__1481, vect_cst__1545,
{ 2, 4, 6, 8, 10, 12, 14, 16, .....
And with the patch:
_472 = stride.3_51 * 16;
_473 = (sizetype) _472;
_474 = VEC_SERIES_EXPR <0, _473>;
vect__42.66_504 = .MASK_GATHER_LOAD (vectp.64_501, _474, 1, { 0.0, ...
The pattern occurs a few times in this testcase. (Although, there remain
a number of cases where the vectorizer is falling back to elementwise
BIT_FIELD_REF that I would prefer were done in some vectorized way.)
So in case you had both elements from the group used and thus only one
vector and no permute you'd still get the vector(2) loads and the CTOR.
I suppose you still want to use gather here, but then we either need to
use an element type as large as vector(2) real(kind=8) to be able to use
the same simple VEC_SERIES_EXPR or generate
VEC_SERIES_EXPR <0, _473> and VEC_SERIES_EXPR <1, _473>
and interleave those into an offset vector that goes { 0, 1, _473,
_473 + 1, 2*_473, 2*_473 + 1, ... }
for the case of SLP_TREE_LANES != 1 (or rather group_size != 1).
If you can generate a vector of either addresses or offsets without
going elementwise then we want to use a gather. Even building a offsets
elementwise is better than eating 64 separate sequential memory
latencies when you can parallelize that into a single instruction.
That said, the hook is a bit black/white - whether the target prefers
a gather/scatter over N piecewise operations with equal stride depends
at least on the vector mode. On x86_64 for V2DImode definitely no
gather, for V16SFmode it probably depends (V64QImode gather isn't
supported).
I expected this one might come up. What features would you like to see
in a hook? Should it be the mode, or the vectype? I see also masked_p
that might be relevant to some architecture?
If the patch (and concept) is otherwise OK, I can get that done.
Andrew
Richard.
Richard.
Andrew
------------
For AMD GCN, the instructions available for loading/storing vectors are
always scatter/gather operations (i.e. there are separate addresses for
each vector lane), so the current heuristic to avoid gather/scatter
operations with too many elements in get_group_load_store_type is
counterproductive. Avoiding such operations in that function can
subsequently lead to a missed vectorization opportunity whereby later
analyses in the vectorizer try to use a very wide array type which is
not available on this target, and thus it bails out.
This patch adds a target hook to override the "single_element_p"
heuristic in the function as a target hook, and activates it for GCN. This
allows much better code to be generated for affected loops.
Co-authored-by: Julian Brown <jul...@codesourcery.com>
gcc/
* doc/tm.texi.in (TARGET_VECTORIZE_PREFER_GATHER_SCATTER): Add
documentation hook.
* doc/tm.texi: Regenerate.
* target.def (prefer_gather_scatter): Add target hook under
vectorizer.
* tree-vect-stmts.cc (get_group_load_store_type): Optionally prefer
gather/scatter instructions to scalar/elementwise fallback.
* config/gcn/gcn.cc (TARGET_VECTORIZE_PREFER_GATHER_SCATTER): Define
hook.
---
gcc/config/gcn/gcn.cc | 2 ++
gcc/doc/tm.texi | 5 +++++
gcc/doc/tm.texi.in | 2 ++
gcc/target.def | 8 ++++++++
gcc/tree-vect-stmts.cc | 2 +-
5 files changed, 18 insertions(+), 1 deletion(-)
diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index 3b26d5c6a58..d451bf43355 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -7998,6 +7998,8 @@ gcn_dwarf_register_span (rtx rtl)
gcn_vector_alignment_reachable
#undef TARGET_VECTOR_MODE_SUPPORTED_P
#define TARGET_VECTOR_MODE_SUPPORTED_P gcn_vector_mode_supported_p
+#undef TARGET_VECTORIZE_PREFER_GATHER_SCATTER
+#define TARGET_VECTORIZE_PREFER_GATHER_SCATTER true
#undef TARGET_DOCUMENTATION_NAME
#define TARGET_DOCUMENTATION_NAME "AMD GCN"
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 5e305643b3a..29177d81466 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6511,6 +6511,11 @@ The default is @code{NULL_TREE} which means to not
vectorize scatter
stores.
@end deftypefn
+@deftypevr {Target Hook} bool TARGET_VECTORIZE_PREFER_GATHER_SCATTER
+This hook is set to TRUE if gather loads or scatter stores are cheaper on
+this target than a sequence of elementwise loads or stores.
+@end deftypevr
+
@deftypefn {Target Hook} int TARGET_SIMD_CLONE_COMPUTE_VECSIZE_AND_SIMDLEN
(struct cgraph_node *@var{}, struct cgraph_simd_clone *@var{}, @var{tree},
@var{int}, @var{bool})
This hook should set @var{vecsize_mangle}, @var{vecsize_int},
@var{vecsize_float}
fields in @var{simd_clone} structure pointed by @var{clone_info} argument
and also
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index eccc4d88493..b03ad4c97c6 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4311,6 +4311,8 @@ address; but often a machine-dependent strategy can
generate better code.
@hook TARGET_VECTORIZE_BUILTIN_SCATTER
+@hook TARGET_VECTORIZE_PREFER_GATHER_SCATTER
+
@hook TARGET_SIMD_CLONE_COMPUTE_VECSIZE_AND_SIMDLEN
@hook TARGET_SIMD_CLONE_ADJUST
diff --git a/gcc/target.def b/gcc/target.def
index 38903eb567a..dd57b7072af 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2056,6 +2056,14 @@ all zeros. GCC can then try to branch around the instruction
instead.",
(unsigned ifn),
default_empty_mask_is_expensive)
+/* Prefer gather/scatter loads/stores to e.g. elementwise accesses if\n\
+we cannot use a contiguous access. */
+DEFHOOKPOD
+(prefer_gather_scatter,
+ "This hook is set to TRUE if gather loads or scatter stores are cheaper on\n\
+this target than a sequence of elementwise loads or stores.",
+ bool, false)
+
/* Target builtin that implements vector gather operation. */
DEFHOOK
(builtin_gather,
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 2e9b3d2e686..8ca33f5951a 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -2349,7 +2349,7 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info
stmt_info,
allows us to use contiguous accesses. */
if ((*memory_access_type == VMAT_ELEMENTWISE
|| *memory_access_type == VMAT_STRIDED_SLP)
- && single_element_p
+ && (targetm.vectorize.prefer_gather_scatter || single_element_p)
&& SLP_TREE_LANES (slp_node) == 1
&& loop_vinfo
&& vect_use_strided_gather_scatters_p (stmt_info, loop_vinfo,
--
2.50.0