Re: [PATCH]AArch64: Fix costing of emulated gathers/scatters [PR118188]

Richard Sandiford Thu, 02 Jan 2025 12:33:25 -0800

Tamar Christina <tamar.christ...@arm.com> writes:
>> >> So I think ideally, we should try to detect whether the indices come
>> >> directly from memory or are the result of arithmetic.  In the former case,
>> >> we should do the loads adjustment above.  In the latter case, we should
>> >> keep the vec_to_scalar accounting unchanged.
>> >
>> > I can do this but...
>> >
>> >>
>> >> Of course, these umovs are likely to be more throughput-limited than we
>> >> model, but that's a separate pre-existing problem...
>> >
>> > I agree with the above, the reason I just updated loads is as you already 
>> > said
>> > that the umov accounting as general operations don't account for the 
>> > bottleneck.
>> > In general umovs are more throughput limited than loads and the number of
>> general
>> > ops we can execute would in the example above misrepresent the throughput 
>> > as
>> it
>> > still thinks it can execute all transfers + all scalar loads in one cycle. 
>> >  As the number
>> of VX
>> > increases modelling them as general ops incorrectly favors the emulated 
>> > gather.
>> See e.g.
>> > Cortex-X925.
>> >
>> > By still modelling them as loads it more accurately models that the data 
>> > loads
>> have to
>> > wait for the indexes.
>> >
>> > The problem with modelling them as general ops is that when compared to the
>> IFN for
>> > SVE they end up being cheaper.  For instance the umov case above is faster 
>> > using
>> an
>> > actual SVE gather.
>> >
>> > So if we really want to be accurate we have to model vec transfers as 
>> > otherwise it
>> still
>> > models the index transfers as effectively free.
>> 
>> Yeah, agree that we eventually need to model transfers properly.
>> 
>> But I think my point still stands that modelling loads instead of
>> general ops won't help in cases where memory doesn't dominate.
>> Modelling UMOVs as general ops does give us something in that case,
>> even if it's not perfect.
>> 
>> How about, as a compromise, just removing the early return?  That way
>> we won't "regress" in the counting of general ops for the case of
>> arithmetic indices, but will still get the benefit of the load
>> heuristic.
>
> I'm ok with this.  An alternative solution here might be doing what i386 does 
> and
> scale the operation by vector subparts.  I assume the goal there was to model 
> the
> latency of doing the individual transfers in a dependency chain.  But if we 
> increase
> the number of general ops instead when the source isn't a memory then we
> simulate that it's data bound but also account for the throughput limitation 
> somewhat.


Scaling by subparts feels like double-counting (well, squaring) in this
case, since the provided count of 4 already reflects the number of subparts.

It looks we do still vectorise for Advanced SIMD with -mtune=neoverse-v2
(rather than -mcpu=neoverse-v2, so with SVE disabled) and without any tuning
option.  Is that the right call, or do we need to tweak the latency
costs too?

>> >> For the scatter store case:
>> >>
>> >> float
>> >> s4115 (int *ip)
>> >> {
>> >>   for (int i = 0; i < LEN_1D; i++)
>> >>     {
>> >>       b[ip[i]] = a[i] + 1;
>> >>     }
>> >> }
>> >>
>> >> the vectoriser (unhelpfully) costs both the index-to-scalars and
>> >> data-to-scalars as vec_to_scalar, meaning that we'll double-count
>> >> the extra loads.
>> >>
>> >
>> > I think that's more accurate though.
>> >
>> > This example is load Q -> umov -> store.
>> >
>> > This is a 3 insn dependency chain, where modelling the umov as load
>> > more accurately depicts the dependency on the preceding load.
>> 
>> For the above we generate:
>> 
>> .L2:
>>         ldr     q30, [x7, x1]
>>         add     x3, x0, x1
>>         ldrsw   x6, [x0, x1]
>>         add     x1, x1, 16
>>         ldp     w5, w4, [x3, 4]
>>         add     x5, x2, w5, sxtw 2
>>         add     x4, x2, w4, sxtw 2
>>         fadd    v30.4s, v30.4s, v31.4s
>>         ldr     w3, [x3, 12]
>>         add     x3, x2, w3, sxtw 2
>>         str     s30, [x2, x6, lsl 2]
>>         st1     {v30.s}[1], [x5]
>>         st1     {v30.s}[2], [x4]
>>         st1     {v30.s}[3], [x3]
>>         cmp     x1, x8
>>         bne     .L2
>> 
>> i.e. we use separate address arithmetic and avoid UMOVs.  Counting
>> two loads and one store for each element of the scatter store seems
>> like overkill for that.
>
> Hmm agreed..
>
> How about for stores we increase the load counts by count / 2?
>
> This would account for the fact that we know we have indexed stores
> and so the data-to-scalar operation is free?

Yeah, sounds good.  We should probably divide count itself by 2,
then apply the new count to both the load heuristic and the general ops,
to avoid double-counting in both.  (The V pipe usage for stores is
modelled as part of the scalar_store itself.)  But like you say,
we should probably drop the - 1 from the load adjustment for stores,
because that - 1 would also be applied twice.

Thanks,
Richard

Re: [PATCH]AArch64: Fix costing of emulated gathers/scatters [PR118188]

Reply via email to