> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: Thursday, January 2, 2025 5:54 PM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
> <richard.earns...@arm.com>; ktkac...@gcc.gnu.org
> Subject: Re: [PATCH]AArch64: Fix costing of emulated gathers/scatters
> [PR118188]
> 
> Tamar Christina <tamar.christ...@arm.com> writes:
> >> > [...]
> >> > #define iterations 100000
> >> > #define LEN_1D 32000
> >> >
> >> > float a[LEN_1D], b[LEN_1D];
> >> >
> >> > float
> >> > s4115 (int *ip)
> >> > {
> >> >     float sum = 0.;
> >> >     for (int i = 0; i < LEN_1D; i++)
> >> >         {
> >> >             sum += a[i] * b[ip[i]];
> >> >         }
> >> >     return sum;
> >> > }
> >> > [...]
> >> > diff --git a/gcc/config/aarch64/aarch64.cc 
> >> > b/gcc/config/aarch64/aarch64.cc
> >> > index
> >>
> 6bb4bdf2472e62d9b066a06561da8e516f1b3c3e..cb9b155826d12b622ae0df1
> >> 736e4b042d01cf56a 100644
> >> > --- a/gcc/config/aarch64/aarch64.cc
> >> > +++ b/gcc/config/aarch64/aarch64.cc
> >> > @@ -17358,6 +17358,25 @@ aarch64_vector_costs::count_ops (unsigned
> int
> >> count, vect_cost_for_stmt kind,
> >> >          return;
> >> >      }
> >> >
> >> > +  /* Detect the case where we are using an emulated gather/scatter.  
> >> > When a
> >> > +     target does not support gathers and scatters directly the 
> >> > vectorizer
> >> > +     emulates these by constructing an index vector and then issuing an
> >> > +     extraction for every lane in the vector.  This is subsequently 
> >> > lowered
> >> > +     by veclower into a series of loads which creates the scalar 
> >> > indexes for
> >> > +     the subsequent loads.  After the final loads are done it issues a
> >> > +     vec_construct to recreate the vector from the scalar.  For costing 
> >> > when
> >> > +     we see a vec_to_scalar on a stmt with VMAT_GATHER_SCATTER we are
> >> dealing
> >> > +     with an emulated instruction and should adjust costing properly.  
> >> > */
> >> > +  if (kind == vec_to_scalar
> >> > +      && (m_vec_flags & VEC_ADVSIMD)
> >> > +      && vect_mem_access_type (stmt_info, node) ==
> VMAT_GATHER_SCATTER)
> >> > +    {
> >> > +      if (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type)
> >> > +        ops->loads += count - 1;
> >> > +      else
> >> > +        ops->stores += count - 1;
> >>
> >> ...since the aim is to replace:
> >>
> >> - 1 vector load for the indices
> >>
> >> with
> >>
> >> - 4 vector loads for the indices
> >>
> >> I think this should be changing ops->loads even for scatter stores.
> >>
> >
> > Ah yes that's true... because the indexes for the stores are loads 
> > themselves.
> >
> >> But this assumes that the gather load indices come directly from memory.
> >> That isn't always the case.  If we have:
> >>
> >> float
> >> s4115 (int *ip)
> >> {
> >>     float sum = 0.;
> >>     for (int i = 0; i < LEN_1D; i++)
> >>       {
> >>         sum += a[i] * b[ip[i] + 1];
> >>       }
> >>     return sum;
> >> }
> >>
> >> then we'll have a vector load, a vector add, and then the 4 umovs
> >> that, as you said above, were elided by post-vectoriser optimisations
> >> in your b[ip[i]] example:
> >>
> >> .L2:
> >>         ldr     q31, [x0, x1]
> >>         ldr     q29, [x6, x1]
> >>         add     x1, x1, 16
> >>         add     v31.4s, v31.4s, v26.4s
> >>         umov    w5, v31.s[1]
> >>         umov    w4, v31.s[3]
> >>         umov    w3, v31.s[2]
> >>         fmov    w8, s31
> >>         ldr     s30, [x2, w5, sxtw 2]
> >>         ldr     s27, [x2, w4, sxtw 2]
> >>         ldr     s31, [x2, w8, sxtw 2]
> >>         ldr     s28, [x2, w3, sxtw 2]
> >>         uzp1    v30.2s, v30.2s, v27.2s
> >>         uzp1    v31.2s, v31.2s, v28.2s
> >>         zip1    v31.4s, v31.4s, v30.4s
> >>         fmla    v0.4s, v29.4s, v31.4s
> >>         cmp     x1, x7
> >>         bne     .L2
> >>
> >> These umovs are currently modelled:
> >>
> >>   note:    load operations = 6
> >>   note:    store operations = 0
> >>   note:    general operations = 7
> >>   note:    reduction latency = 2
> >>
> >> although I'm guessing it should be:
> >>
> >>   note:    load operations = 6
> >>   note:    store operations = 0
> >>   note:    general operations = 9
> >>   note:    reduction latency = 2
> >>
> >> instead.  The underaccounting comes from vec_construct, which counts 1
> >> rather than 3 operations for moving the scalars back to a vector.
> >>
> >> This patch would remove the umov accounting to give:
> >>
> >>   note:    load operations = 9
> >>   note:    store operations = 0
> >>   note:    general operations = 3
> >>   note:    reduction latency = 2
> >>
> >> Counting loads rather than general ops wouldn't matter for tight loops
> >> like these in which memory dominates anyway, since counting loads is
> >> then pessimistically correct.  But in a loop that is compute bound,
> >> it's probably more important to get this right.
> >>
> >> So I think ideally, we should try to detect whether the indices come
> >> directly from memory or are the result of arithmetic.  In the former case,
> >> we should do the loads adjustment above.  In the latter case, we should
> >> keep the vec_to_scalar accounting unchanged.
> >
> > I can do this but...
> >
> >>
> >> Of course, these umovs are likely to be more throughput-limited than we
> >> model, but that's a separate pre-existing problem...
> >
> > I agree with the above, the reason I just updated loads is as you already 
> > said
> > that the umov accounting as general operations don't account for the 
> > bottleneck.
> > In general umovs are more throughput limited than loads and the number of
> general
> > ops we can execute would in the example above misrepresent the throughput as
> it
> > still thinks it can execute all transfers + all scalar loads in one cycle.  
> > As the number
> of VX
> > increases modelling them as general ops incorrectly favors the emulated 
> > gather.
> See e.g.
> > Cortex-X925.
> >
> > By still modelling them as loads it more accurately models that the data 
> > loads
> have to
> > wait for the indexes.
> >
> > The problem with modelling them as general ops is that when compared to the
> IFN for
> > SVE they end up being cheaper.  For instance the umov case above is faster 
> > using
> an
> > actual SVE gather.
> >
> > So if we really want to be accurate we have to model vec transfers as 
> > otherwise it
> still
> > models the index transfers as effectively free.
> 
> Yeah, agree that we eventually need to model transfers properly.
> 
> But I think my point still stands that modelling loads instead of
> general ops won't help in cases where memory doesn't dominate.
> Modelling UMOVs as general ops does give us something in that case,
> even if it's not perfect.
> 
> How about, as a compromise, just removing the early return?  That way
> we won't "regress" in the counting of general ops for the case of
> arithmetic indices, but will still get the benefit of the load
> heuristic.

I'm ok with this.  An alternative solution here might be doing what i386 does 
and
scale the operation by vector subparts.  I assume the goal there was to model 
the
latency of doing the individual transfers in a dependency chain.  But if we 
increase
the number of general ops instead when the source isn't a memory then we
simulate that it's data bound but also account for the throughput limitation 
somewhat.

I'm happy to go with your original suggestion though.

> 
> >> For the scatter store case:
> >>
> >> float
> >> s4115 (int *ip)
> >> {
> >>   for (int i = 0; i < LEN_1D; i++)
> >>     {
> >>       b[ip[i]] = a[i] + 1;
> >>     }
> >> }
> >>
> >> the vectoriser (unhelpfully) costs both the index-to-scalars and
> >> data-to-scalars as vec_to_scalar, meaning that we'll double-count
> >> the extra loads.
> >>
> >
> > I think that's more accurate though.
> >
> > This example is load Q -> umov -> store.
> >
> > This is a 3 insn dependency chain, where modelling the umov as load
> > more accurately depicts the dependency on the preceding load.
> 
> For the above we generate:
> 
> .L2:
>         ldr     q30, [x7, x1]
>         add     x3, x0, x1
>         ldrsw   x6, [x0, x1]
>         add     x1, x1, 16
>         ldp     w5, w4, [x3, 4]
>         add     x5, x2, w5, sxtw 2
>         add     x4, x2, w4, sxtw 2
>         fadd    v30.4s, v30.4s, v31.4s
>         ldr     w3, [x3, 12]
>         add     x3, x2, w3, sxtw 2
>         str     s30, [x2, x6, lsl 2]
>         st1     {v30.s}[1], [x5]
>         st1     {v30.s}[2], [x4]
>         st1     {v30.s}[3], [x3]
>         cmp     x1, x8
>         bne     .L2
> 
> i.e. we use separate address arithmetic and avoid UMOVs.  Counting
> two loads and one store for each element of the scatter store seems
> like overkill for that.

Hmm agreed..

How about for stores we increase the load counts by count / 2?

This would account for the fact that we know we have indexed stores
and so the data-to-scalar operation is free?

Thanks,
Tamar

> 
> Thanks,
> Richard

Reply via email to