> >> i.e. we use separate address arithmetic and avoid UMOVs. Counting > >> two loads and one store for each element of the scatter store seems > >> like overkill for that. > > > > Hmm agreed.. > > > > How about for stores we increase the load counts by count / 2? > > > > This would account for the fact that we know we have indexed stores > > and so the data-to-scalar operation is free? > > Yeah, sounds good. We should probably divide count itself by 2, > then apply the new count to both the load heuristic and the general ops, > to avoid double-counting in both. (The V pipe usage for stores is > modelled as part of the scalar_store itself.) But like you say, > we should probably drop the - 1 from the load adjustment for stores, > because that - 1 would also be applied twice. >
Here's a new version of the patch. For now I've punted the data bound variant as to do that properly we do need to model the transfer throughputs. So differing that to GCC 16. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: PR target/118188 * config/aarch64/aarch64.cc (aarch64_vector_costs::count_ops): Adjust throughput of emulated gather and scatters. gcc/testsuite/ChangeLog: PR target/118188 * gcc.target/aarch64/sve/gather_load_12.c: New test. * gcc.target/aarch64/sve/gather_load_13.c: New test. * gcc.target/aarch64/sve/gather_load_14.c: New test. -- inline copy of patch -- diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index d6a8e4c209527f53e6292887b3fbc8a0ac3245da..efe9a935192748fba4e71e98a55693317f5ca786 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -17348,6 +17348,47 @@ aarch64_vector_costs::count_ops (unsigned int count, vect_cost_for_stmt kind, return; } + /* Detect the case where we are using an emulated gather/scatter. When a + target does not support gathers and scatters directly the vectorizer + emulates these by constructing an index vector and then issuing an + extraction for every lane in the vector. This is subsequently lowered + by veclower into a series of loads which creates the scalar indexes for + the subsequent loads. After the final loads are done it issues a + vec_construct to recreate the vector from the scalar. For costing when + we see a vec_to_scalar on a stmt with VMAT_GATHER_SCATTER we are dealing + with an emulated instruction and should adjust costing properly. */ + if (kind == vec_to_scalar + && (m_vec_flags & VEC_ADVSIMD) + && vect_mem_access_type (stmt_info, node) == VMAT_GATHER_SCATTER) + { + auto dr = STMT_VINFO_DATA_REF (stmt_info); + tree dr_ref = DR_REF (dr); + /* Only really expect MEM_REF or ARRAY_REF here. Ignore the rest. */ + switch (TREE_CODE (dr_ref)) + { + case MEM_REF: + case ARRAY_REF: + case TARGET_MEM_REF: + { + tree offset = TREE_OPERAND (dr_ref, 1); + if (SSA_VAR_P (offset) + && gimple_vuse (SSA_NAME_DEF_STMT (offset))) + { + if (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type) + ops->loads += count - 1; + else + /* Stores want to count both the index to array and data to + array using vec_to_scalar. However we have index stores in + Adv.SIMD and so we only want to adjust the index loads. */ + ops->loads += count / 2; + return; + } + break; + } + default: + break; + } + } /* Count the basic operation cost associated with KIND. */ switch (kind) diff --git a/gcc/testsuite/gcc.target/aarch64/sve/gather_load_12.c b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_12.c new file mode 100644 index 0000000000000000000000000000000000000000..d550f005d638f260973312fbccd05e2bb9aef1a7 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_12.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-Ofast -mcpu=neoverse-v2" } */ + +#define iterations 100000 +#define LEN_1D 32000 + +float a[LEN_1D], b[LEN_1D]; + +float +s4115 (int *ip) +{ + float sum = 0.; + for (int i = 0; i < LEN_1D; i++) + { + sum += a[i] * b[ip[i]]; + } + return sum; +} + +/* { dg-final { scan-assembler {\s+ld1w\t} } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/gather_load_13.c b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_13.c new file mode 100644 index 0000000000000000000000000000000000000000..24da0646a75e41eae0c0dd6a2b8af58631ac94cb --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_13.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-Ofast -mcpu=neoverse-v2" } */ + +#define iterations 100000 +#define LEN_1D 32000 + +float a[LEN_1D], b[LEN_1D]; + +float +s4115 (int *ip) +{ + float sum = 0.; + for (int i = 0; i < LEN_1D; i++) + { + sum += a[i] * b[ip[i] + 1]; + } + return sum; +} + +/* { dg-final { scan-assembler {\s+ld1w\t} { xfail *-*-* } } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/gather_load_14.c b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_14.c new file mode 100644 index 0000000000000000000000000000000000000000..77d06d2a0e0ee89e5e45730eebf9b4bebfa2aaed --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_14.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-Ofast -mcpu=neoverse-v2" } */ + +#define iterations 100000 +#define LEN_1D 32000 + +float a[LEN_1D], b[LEN_1D]; + +float +s4115 (int *ip) +{ + float sum = 0.; + for (int i = 0; i < LEN_1D; i++) + { + sum += a[i] * b[ip[i]]; + } + return sum; +} + +/* { dg-final { scan-assembler-not {\s+st1w\t} } } */
rb19109.patch
Description: rb19109.patch