Tamar Christina <tamar.christ...@arm.com> writes: >> > [...] >> > #define iterations 100000 >> > #define LEN_1D 32000 >> > >> > float a[LEN_1D], b[LEN_1D]; >> > >> > float >> > s4115 (int *ip) >> > { >> > float sum = 0.; >> > for (int i = 0; i < LEN_1D; i++) >> > { >> > sum += a[i] * b[ip[i]]; >> > } >> > return sum; >> > } >> > [...] >> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc >> > index >> 6bb4bdf2472e62d9b066a06561da8e516f1b3c3e..cb9b155826d12b622ae0df1 >> 736e4b042d01cf56a 100644 >> > --- a/gcc/config/aarch64/aarch64.cc >> > +++ b/gcc/config/aarch64/aarch64.cc >> > @@ -17358,6 +17358,25 @@ aarch64_vector_costs::count_ops (unsigned int >> count, vect_cost_for_stmt kind, >> > return; >> > } >> > >> > + /* Detect the case where we are using an emulated gather/scatter. When >> > a >> > + target does not support gathers and scatters directly the vectorizer >> > + emulates these by constructing an index vector and then issuing an >> > + extraction for every lane in the vector. This is subsequently >> > lowered >> > + by veclower into a series of loads which creates the scalar indexes >> > for >> > + the subsequent loads. After the final loads are done it issues a >> > + vec_construct to recreate the vector from the scalar. For costing >> > when >> > + we see a vec_to_scalar on a stmt with VMAT_GATHER_SCATTER we are >> dealing >> > + with an emulated instruction and should adjust costing properly. */ >> > + if (kind == vec_to_scalar >> > + && (m_vec_flags & VEC_ADVSIMD) >> > + && vect_mem_access_type (stmt_info, node) == VMAT_GATHER_SCATTER) >> > + { >> > + if (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type) >> > + ops->loads += count - 1; >> > + else >> > + ops->stores += count - 1; >> >> ...since the aim is to replace: >> >> - 1 vector load for the indices >> >> with >> >> - 4 vector loads for the indices >> >> I think this should be changing ops->loads even for scatter stores. >> > > Ah yes that's true... because the indexes for the stores are loads themselves. > >> But this assumes that the gather load indices come directly from memory. >> That isn't always the case. If we have: >> >> float >> s4115 (int *ip) >> { >> float sum = 0.; >> for (int i = 0; i < LEN_1D; i++) >> { >> sum += a[i] * b[ip[i] + 1]; >> } >> return sum; >> } >> >> then we'll have a vector load, a vector add, and then the 4 umovs >> that, as you said above, were elided by post-vectoriser optimisations >> in your b[ip[i]] example: >> >> .L2: >> ldr q31, [x0, x1] >> ldr q29, [x6, x1] >> add x1, x1, 16 >> add v31.4s, v31.4s, v26.4s >> umov w5, v31.s[1] >> umov w4, v31.s[3] >> umov w3, v31.s[2] >> fmov w8, s31 >> ldr s30, [x2, w5, sxtw 2] >> ldr s27, [x2, w4, sxtw 2] >> ldr s31, [x2, w8, sxtw 2] >> ldr s28, [x2, w3, sxtw 2] >> uzp1 v30.2s, v30.2s, v27.2s >> uzp1 v31.2s, v31.2s, v28.2s >> zip1 v31.4s, v31.4s, v30.4s >> fmla v0.4s, v29.4s, v31.4s >> cmp x1, x7 >> bne .L2 >> >> These umovs are currently modelled: >> >> note: load operations = 6 >> note: store operations = 0 >> note: general operations = 7 >> note: reduction latency = 2 >> >> although I'm guessing it should be: >> >> note: load operations = 6 >> note: store operations = 0 >> note: general operations = 9 >> note: reduction latency = 2 >> >> instead. The underaccounting comes from vec_construct, which counts 1 >> rather than 3 operations for moving the scalars back to a vector. >> >> This patch would remove the umov accounting to give: >> >> note: load operations = 9 >> note: store operations = 0 >> note: general operations = 3 >> note: reduction latency = 2 >> >> Counting loads rather than general ops wouldn't matter for tight loops >> like these in which memory dominates anyway, since counting loads is >> then pessimistically correct. But in a loop that is compute bound, >> it's probably more important to get this right. >> >> So I think ideally, we should try to detect whether the indices come >> directly from memory or are the result of arithmetic. In the former case, >> we should do the loads adjustment above. In the latter case, we should >> keep the vec_to_scalar accounting unchanged. > > I can do this but... > >> >> Of course, these umovs are likely to be more throughput-limited than we >> model, but that's a separate pre-existing problem... > > I agree with the above, the reason I just updated loads is as you already said > that the umov accounting as general operations don't account for the > bottleneck. > In general umovs are more throughput limited than loads and the number of > general > ops we can execute would in the example above misrepresent the throughput as > it > still thinks it can execute all transfers + all scalar loads in one cycle. > As the number of VX > increases modelling them as general ops incorrectly favors the emulated > gather. See e.g. > Cortex-X925. > > By still modelling them as loads it more accurately models that the data > loads have to > wait for the indexes. > > The problem with modelling them as general ops is that when compared to the > IFN for > SVE they end up being cheaper. For instance the umov case above is faster > using an > actual SVE gather. > > So if we really want to be accurate we have to model vec transfers as > otherwise it still > models the index transfers as effectively free.
Yeah, agree that we eventually need to model transfers properly. But I think my point still stands that modelling loads instead of general ops won't help in cases where memory doesn't dominate. Modelling UMOVs as general ops does give us something in that case, even if it's not perfect. How about, as a compromise, just removing the early return? That way we won't "regress" in the counting of general ops for the case of arithmetic indices, but will still get the benefit of the load heuristic. >> For the scatter store case: >> >> float >> s4115 (int *ip) >> { >> for (int i = 0; i < LEN_1D; i++) >> { >> b[ip[i]] = a[i] + 1; >> } >> } >> >> the vectoriser (unhelpfully) costs both the index-to-scalars and >> data-to-scalars as vec_to_scalar, meaning that we'll double-count >> the extra loads. >> > > I think that's more accurate though. > > This example is load Q -> umov -> store. > > This is a 3 insn dependency chain, where modelling the umov as load > more accurately depicts the dependency on the preceding load. For the above we generate: .L2: ldr q30, [x7, x1] add x3, x0, x1 ldrsw x6, [x0, x1] add x1, x1, 16 ldp w5, w4, [x3, 4] add x5, x2, w5, sxtw 2 add x4, x2, w4, sxtw 2 fadd v30.4s, v30.4s, v31.4s ldr w3, [x3, 12] add x3, x2, w3, sxtw 2 str s30, [x2, x6, lsl 2] st1 {v30.s}[1], [x5] st1 {v30.s}[2], [x4] st1 {v30.s}[3], [x3] cmp x1, x8 bne .L2 i.e. we use separate address arithmetic and avoid UMOVs. Counting two loads and one store for each element of the scatter store seems like overkill for that. Thanks, Richard