Tamar Christina <tamar.christ...@arm.com> writes: >> >> i.e. we use separate address arithmetic and avoid UMOVs. Counting >> >> two loads and one store for each element of the scatter store seems >> >> like overkill for that. >> > >> > Hmm agreed.. >> > >> > How about for stores we increase the load counts by count / 2? >> > >> > This would account for the fact that we know we have indexed stores >> > and so the data-to-scalar operation is free? >> >> Yeah, sounds good. We should probably divide count itself by 2, >> then apply the new count to both the load heuristic and the general ops, >> to avoid double-counting in both. (The V pipe usage for stores is >> modelled as part of the scalar_store itself.) But like you say, >> we should probably drop the - 1 from the load adjustment for stores, >> because that - 1 would also be applied twice. >> > > Here's a new version of the patch. For now I've punted the data bound variant > as to do that properly we do need to model the transfer throughputs. So > differing > that to GCC 16.
OK, so just to check I understand, that's why the patch isn't just: if (kind == vec_to_scalar && (m_vec_flags & VEC_ADVSIMD) && vect_mem_access_type (stmt_info, node) == VMAT_GATHER_SCATTER) { if (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type) ops->loads += count - 1; else /* Stores want to count both the index to array and data to array using vec_to_scalar. However we have index stores in Adv.SIMD and so we only want to adjust the index loads. */ ops->loads += count / 2; } (with no early return). Is that right? If so, that's ok with me, but... > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > Ok for master? > > Thanks, > Tamar > > gcc/ChangeLog: > > PR target/118188 > * config/aarch64/aarch64.cc (aarch64_vector_costs::count_ops): Adjust > throughput of emulated gather and scatters. > > gcc/testsuite/ChangeLog: > > PR target/118188 > * gcc.target/aarch64/sve/gather_load_12.c: New test. > * gcc.target/aarch64/sve/gather_load_13.c: New test. > * gcc.target/aarch64/sve/gather_load_14.c: New test. > > -- inline copy of patch -- > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index > d6a8e4c209527f53e6292887b3fbc8a0ac3245da..efe9a935192748fba4e71e98a55693317f5ca786 > 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -17348,6 +17348,47 @@ aarch64_vector_costs::count_ops (unsigned int count, > vect_cost_for_stmt kind, > return; > } > > + /* Detect the case where we are using an emulated gather/scatter. When a > + target does not support gathers and scatters directly the vectorizer > + emulates these by constructing an index vector and then issuing an > + extraction for every lane in the vector. This is subsequently lowered > + by veclower into a series of loads which creates the scalar indexes for > + the subsequent loads. I think the last sentence should be "If the index vector is loaded from memory, the vector load and extractions are subsequently lowered by veclower into a series of scalar index loads". (The key part being the "if".) > + After the final loads are done it issues a > + vec_construct to recreate the vector from the scalar. For costing when > + we see a vec_to_scalar on a stmt with VMAT_GATHER_SCATTER we are dealing > + with an emulated instruction and should adjust costing properly. */ > + if (kind == vec_to_scalar > + && (m_vec_flags & VEC_ADVSIMD) > + && vect_mem_access_type (stmt_info, node) == VMAT_GATHER_SCATTER) > + { > + auto dr = STMT_VINFO_DATA_REF (stmt_info); > + tree dr_ref = DR_REF (dr); > + /* Only really expect MEM_REF or ARRAY_REF here. Ignore the rest. */ > + switch (TREE_CODE (dr_ref)) > + { > + case MEM_REF: > + case ARRAY_REF: > + case TARGET_MEM_REF: > + { > + tree offset = TREE_OPERAND (dr_ref, 1); > + if (SSA_VAR_P (offset) > + && gimple_vuse (SSA_NAME_DEF_STMT (offset))) > + { > + if (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type) > + ops->loads += count - 1; > + else > + /* Stores want to count both the index to array and data to > + array using vec_to_scalar. However we have index stores in > + Adv.SIMD and so we only want to adjust the index loads. */ > + ops->loads += count / 2; > + return; > + } > + break; > + } > + default: > + break; > + } > + } Operand 1 of MEM_REF and TARGET_MEM_REF are always constant, so the handling of those codes looks redundant. Perhaps we should instead use: while (handled_component_p (dr_ref)) { if (TREE_CODE (dr_ref) == ARRAY_REF) { ...early return or break if SSA offset found... } dr_ref = TREE_OPERAND (dr_ref, 0); } A gather load could reasonably be to COMPONENT_REFs or BIT_FIELD_REFs of an ARRAY_REF, so the ARRAY_REF might not be the outermost code. Thanks, Richard > /* Count the basic operation cost associated with KIND. */ > switch (kind) > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/gather_load_12.c > b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_12.c > new file mode 100644 > index > 0000000000000000000000000000000000000000..d550f005d638f260973312fbccd05e2bb9aef1a7 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_12.c > @@ -0,0 +1,20 @@ > +/* { dg-do compile } */ > +/* { dg-additional-options "-Ofast -mcpu=neoverse-v2" } */ > + > +#define iterations 100000 > +#define LEN_1D 32000 > + > +float a[LEN_1D], b[LEN_1D]; > + > +float > +s4115 (int *ip) > +{ > + float sum = 0.; > + for (int i = 0; i < LEN_1D; i++) > + { > + sum += a[i] * b[ip[i]]; > + } > + return sum; > +} > + > +/* { dg-final { scan-assembler {\s+ld1w\t} } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/gather_load_13.c > b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_13.c > new file mode 100644 > index > 0000000000000000000000000000000000000000..24da0646a75e41eae0c0dd6a2b8af58631ac94cb > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_13.c > @@ -0,0 +1,20 @@ > +/* { dg-do compile } */ > +/* { dg-additional-options "-Ofast -mcpu=neoverse-v2" } */ > + > +#define iterations 100000 > +#define LEN_1D 32000 > + > +float a[LEN_1D], b[LEN_1D]; > + > +float > +s4115 (int *ip) > +{ > + float sum = 0.; > + for (int i = 0; i < LEN_1D; i++) > + { > + sum += a[i] * b[ip[i] + 1]; > + } > + return sum; > +} > + > +/* { dg-final { scan-assembler {\s+ld1w\t} { xfail *-*-* } } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/gather_load_14.c > b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_14.c > new file mode 100644 > index > 0000000000000000000000000000000000000000..77d06d2a0e0ee89e5e45730eebf9b4bebfa2aaed > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_14.c > @@ -0,0 +1,20 @@ > +/* { dg-do compile } */ > +/* { dg-additional-options "-Ofast -mcpu=neoverse-v2" } */ > + > +#define iterations 100000 > +#define LEN_1D 32000 > + > +float a[LEN_1D], b[LEN_1D]; > + > +float > +s4115 (int *ip) > +{ > + float sum = 0.; > + for (int i = 0; i < LEN_1D; i++) > + { > + sum += a[i] * b[ip[i]]; > + } > + return sum; > +} > + > +/* { dg-final { scan-assembler-not {\s+st1w\t} } } */