> -----Original Message----- > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: Thursday, January 2, 2025 5:54 PM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw > <richard.earns...@arm.com>; ktkac...@gcc.gnu.org > Subject: Re: [PATCH]AArch64: Fix costing of emulated gathers/scatters > [PR118188] > > Tamar Christina <tamar.christ...@arm.com> writes: > >> > [...] > >> > #define iterations 100000 > >> > #define LEN_1D 32000 > >> > > >> > float a[LEN_1D], b[LEN_1D]; > >> > > >> > float > >> > s4115 (int *ip) > >> > { > >> > float sum = 0.; > >> > for (int i = 0; i < LEN_1D; i++) > >> > { > >> > sum += a[i] * b[ip[i]]; > >> > } > >> > return sum; > >> > } > >> > [...] > >> > diff --git a/gcc/config/aarch64/aarch64.cc > >> > b/gcc/config/aarch64/aarch64.cc > >> > index > >> > 6bb4bdf2472e62d9b066a06561da8e516f1b3c3e..cb9b155826d12b622ae0df1 > >> 736e4b042d01cf56a 100644 > >> > --- a/gcc/config/aarch64/aarch64.cc > >> > +++ b/gcc/config/aarch64/aarch64.cc > >> > @@ -17358,6 +17358,25 @@ aarch64_vector_costs::count_ops (unsigned > int > >> count, vect_cost_for_stmt kind, > >> > return; > >> > } > >> > > >> > + /* Detect the case where we are using an emulated gather/scatter. > >> > When a > >> > + target does not support gathers and scatters directly the > >> > vectorizer > >> > + emulates these by constructing an index vector and then issuing an > >> > + extraction for every lane in the vector. This is subsequently > >> > lowered > >> > + by veclower into a series of loads which creates the scalar > >> > indexes for > >> > + the subsequent loads. After the final loads are done it issues a > >> > + vec_construct to recreate the vector from the scalar. For costing > >> > when > >> > + we see a vec_to_scalar on a stmt with VMAT_GATHER_SCATTER we are > >> dealing > >> > + with an emulated instruction and should adjust costing properly. > >> > */ > >> > + if (kind == vec_to_scalar > >> > + && (m_vec_flags & VEC_ADVSIMD) > >> > + && vect_mem_access_type (stmt_info, node) == > VMAT_GATHER_SCATTER) > >> > + { > >> > + if (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type) > >> > + ops->loads += count - 1; > >> > + else > >> > + ops->stores += count - 1; > >> > >> ...since the aim is to replace: > >> > >> - 1 vector load for the indices > >> > >> with > >> > >> - 4 vector loads for the indices > >> > >> I think this should be changing ops->loads even for scatter stores. > >> > > > > Ah yes that's true... because the indexes for the stores are loads > > themselves. > > > >> But this assumes that the gather load indices come directly from memory. > >> That isn't always the case. If we have: > >> > >> float > >> s4115 (int *ip) > >> { > >> float sum = 0.; > >> for (int i = 0; i < LEN_1D; i++) > >> { > >> sum += a[i] * b[ip[i] + 1]; > >> } > >> return sum; > >> } > >> > >> then we'll have a vector load, a vector add, and then the 4 umovs > >> that, as you said above, were elided by post-vectoriser optimisations > >> in your b[ip[i]] example: > >> > >> .L2: > >> ldr q31, [x0, x1] > >> ldr q29, [x6, x1] > >> add x1, x1, 16 > >> add v31.4s, v31.4s, v26.4s > >> umov w5, v31.s[1] > >> umov w4, v31.s[3] > >> umov w3, v31.s[2] > >> fmov w8, s31 > >> ldr s30, [x2, w5, sxtw 2] > >> ldr s27, [x2, w4, sxtw 2] > >> ldr s31, [x2, w8, sxtw 2] > >> ldr s28, [x2, w3, sxtw 2] > >> uzp1 v30.2s, v30.2s, v27.2s > >> uzp1 v31.2s, v31.2s, v28.2s > >> zip1 v31.4s, v31.4s, v30.4s > >> fmla v0.4s, v29.4s, v31.4s > >> cmp x1, x7 > >> bne .L2 > >> > >> These umovs are currently modelled: > >> > >> note: load operations = 6 > >> note: store operations = 0 > >> note: general operations = 7 > >> note: reduction latency = 2 > >> > >> although I'm guessing it should be: > >> > >> note: load operations = 6 > >> note: store operations = 0 > >> note: general operations = 9 > >> note: reduction latency = 2 > >> > >> instead. The underaccounting comes from vec_construct, which counts 1 > >> rather than 3 operations for moving the scalars back to a vector. > >> > >> This patch would remove the umov accounting to give: > >> > >> note: load operations = 9 > >> note: store operations = 0 > >> note: general operations = 3 > >> note: reduction latency = 2 > >> > >> Counting loads rather than general ops wouldn't matter for tight loops > >> like these in which memory dominates anyway, since counting loads is > >> then pessimistically correct. But in a loop that is compute bound, > >> it's probably more important to get this right. > >> > >> So I think ideally, we should try to detect whether the indices come > >> directly from memory or are the result of arithmetic. In the former case, > >> we should do the loads adjustment above. In the latter case, we should > >> keep the vec_to_scalar accounting unchanged. > > > > I can do this but... > > > >> > >> Of course, these umovs are likely to be more throughput-limited than we > >> model, but that's a separate pre-existing problem... > > > > I agree with the above, the reason I just updated loads is as you already > > said > > that the umov accounting as general operations don't account for the > > bottleneck. > > In general umovs are more throughput limited than loads and the number of > general > > ops we can execute would in the example above misrepresent the throughput as > it > > still thinks it can execute all transfers + all scalar loads in one cycle. > > As the number > of VX > > increases modelling them as general ops incorrectly favors the emulated > > gather. > See e.g. > > Cortex-X925. > > > > By still modelling them as loads it more accurately models that the data > > loads > have to > > wait for the indexes. > > > > The problem with modelling them as general ops is that when compared to the > IFN for > > SVE they end up being cheaper. For instance the umov case above is faster > > using > an > > actual SVE gather. > > > > So if we really want to be accurate we have to model vec transfers as > > otherwise it > still > > models the index transfers as effectively free. > > Yeah, agree that we eventually need to model transfers properly. > > But I think my point still stands that modelling loads instead of > general ops won't help in cases where memory doesn't dominate. > Modelling UMOVs as general ops does give us something in that case, > even if it's not perfect. > > How about, as a compromise, just removing the early return? That way > we won't "regress" in the counting of general ops for the case of > arithmetic indices, but will still get the benefit of the load > heuristic.
I'm ok with this. An alternative solution here might be doing what i386 does and scale the operation by vector subparts. I assume the goal there was to model the latency of doing the individual transfers in a dependency chain. But if we increase the number of general ops instead when the source isn't a memory then we simulate that it's data bound but also account for the throughput limitation somewhat. I'm happy to go with your original suggestion though. > > >> For the scatter store case: > >> > >> float > >> s4115 (int *ip) > >> { > >> for (int i = 0; i < LEN_1D; i++) > >> { > >> b[ip[i]] = a[i] + 1; > >> } > >> } > >> > >> the vectoriser (unhelpfully) costs both the index-to-scalars and > >> data-to-scalars as vec_to_scalar, meaning that we'll double-count > >> the extra loads. > >> > > > > I think that's more accurate though. > > > > This example is load Q -> umov -> store. > > > > This is a 3 insn dependency chain, where modelling the umov as load > > more accurately depicts the dependency on the preceding load. > > For the above we generate: > > .L2: > ldr q30, [x7, x1] > add x3, x0, x1 > ldrsw x6, [x0, x1] > add x1, x1, 16 > ldp w5, w4, [x3, 4] > add x5, x2, w5, sxtw 2 > add x4, x2, w4, sxtw 2 > fadd v30.4s, v30.4s, v31.4s > ldr w3, [x3, 12] > add x3, x2, w3, sxtw 2 > str s30, [x2, x6, lsl 2] > st1 {v30.s}[1], [x5] > st1 {v30.s}[2], [x4] > st1 {v30.s}[3], [x3] > cmp x1, x8 > bne .L2 > > i.e. we use separate address arithmetic and avoid UMOVs. Counting > two loads and one store for each element of the scatter store seems > like overkill for that. Hmm agreed.. How about for stores we increase the load counts by count / 2? This would account for the fact that we know we have indexed stores and so the data-to-scalar operation is free? Thanks, Tamar > > Thanks, > Richard