> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: Tuesday, August 27, 2024 11:46 AM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: Jennifer Schmitz <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org; Kyrylo
> Tkachov <ktkac...@exchange.nvidia.com>
> Subject: Re: [RFC][PATCH] AArch64: Remove
> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> 
> Tamar Christina <tamar.christ...@arm.com> writes:
> > Hi Jennifer,
> >
> >> -----Original Message-----
> >> From: Jennifer Schmitz <jschm...@nvidia.com>
> >> Sent: Friday, August 23, 2024 1:07 PM
> >> To: gcc-patches@gcc.gnu.org
> >> Cc: Richard Sandiford <richard.sandif...@arm.com>; Kyrylo Tkachov
> >> <ktkac...@exchange.nvidia.com>
> >> Subject: [RFC][PATCH] AArch64: Remove
> >> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>
> >> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >> tunable and
> >> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
> >> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
> >> default.
> 
> Thanks for doing this.  This has been on my TODO list ever since the
> tunable was added.
> 
> The history is that these "new" costs were originally added in stage 4
> of GCC 11 for Neoverse V1.  Since the costs were added so late, it wasn't
> appropriate to change the behaviour for any other core.  All the new code
> was therefore gated on this option.
> 
> The new costs do two main things:
> 
> (1) use throughput-based calculations where available, including to choose
>     between Advanced SIMD and SVE
> 
> (2) try to make the latency-based costs more precise, by looking more closely
>     at the provided stmt_info
> 
> Old cost models won't be affected by (1) either way, since they don't
> provide any throughput information.  But they should in principle benefit
> from (2).  So...
> 
> >> To that end, the function aarch64_use_new_vector_costs_p and its uses were
> >> removed. Additionally, guards were added prevent nullpointer dereferences 
> >> of
> >> fields in cpu_vector_cost.
> >>
> >
> > I'm not against this change, but it does mean that we now switch old Adv. 
> > SIMD
> > cost models as well to the new throughput based cost models.  That means 
> > that
> > -mcpu=generic now behaves differently, and -mcpu=neoverse-n1 and I think
> > some distros explicitly use this (I believe yocto for instance does).
> 
> ...it shouldn't mean that we start using throughput-based models for
> cortexa53 etc., since there's no associated issue info.

Yes, I was using throughput based model as a name.  But as you indicated in (2)
it does change the latency calculation. 

My question was because of things in e.g. aarch64_adjust_stmt_cost and friends,
e.g. aarch64_multiply_add_p changes the cost between FMA SIMD vs scalar.

So my question..

> 
> > Have we validated that the old generic cost model still behaves sensibly 
> > with this
> change?

is still valid I think, we *are* changing the cost for all models,
and while they should indeed be more accurate, there could be knock on effects.

Thanks,
Tamar

> >
> >> The patch was bootstrapped and regtested on aarch64-linux-gnu:
> >> No problems bootstrapping, but several test files (in aarch64-sve.exp:
> >> gather_load_extend_X.c
> >> where X is 1 to 4, strided_load_2.c, strided_store_2.c) fail because of 
> >> small
> >> differences
> >> in codegen that make some of the scan-assembler-times tests fail.
> >>
> >> Kyrill suggested to add a -fvect-cost-model=unlimited flag to these tests 
> >> and
> add
> >> some
> >
> > I don't personally like unlimited here as unlimited means just vectorize at 
> > any
> > cost.  This means that costing between modes are also disabled. A lot of 
> > these
> > testcases are intended to test not just that we vectorize but that we 
> > vectorize
> > with efficient code.
> >
> > I'd prefer to use -fvect-cost-model=dynamic if that fixes the testcases.
> 
> Yeah, I don't think -fvect-cost-model=unlimited would work for the
> gather_load_extend_X.c tests, since we require the cost model to decide
> when to use extending loads vs loading a full vector and unpacking.
> 
> [...tries patch...]
> 
> It seems like three main things are contributing to the difference:
> 
> 1. we now cost 0 for a scalar prologue extension from a loaded value
> 2. we now cost gathers & scatters based on gather_load_x32/x64_cost
> 3. we now apply a large one-off cost for gathers (Tamar's recent change)
> 
> (1) is expected.
> 
> (2) partly looks like a latent bug.  We're using the x32 cost for
> VNx2QI->VNx2SI, even though those are really .B->.D loads.
> 
> @@ -16819,7 +16811,7 @@ aarch64_detect_vector_stmt_subtype (vec_info
> *vinfo, vect_cost_for_stmt kind,
>        && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) ==
> VMAT_GATHER_SCATTER)
>      {
>        unsigned int nunits = vect_nunits_for_cost (vectype);
> -      if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64)
> +      if (known_eq (GET_MODE_NUNITS (TYPE_MODE (vectype)),
> aarch64_sve_vg))
>       return { sve_costs->gather_load_x64_cost, nunits };
>        return { sve_costs->gather_load_x32_cost, nunits };
>      }
> 
> fixes that.
> 
> (3) is interesting.  generic_vector_cost isn't used by default for any
> SVE CPU, or any -march=...+sve.  So the question is: should we treat it
> as "architectural intent"/somewhat idealised?  Or should we try to make
> it produce good code for existing SVE cores, in which case it would
> overlap quite a lot with generic_armv8_a and generic_armv9_a.
> 
> If the former, we could set gather_load_x32_init_cost and
> gather_load_x64_init_cost to 0 for generic_sve_vector_cost
> (and nothing else, so that generic_armv*_a are unaffected).
> 
> On the patch:
> 
> > @@ -16733,7 +16723,8 @@ aarch64_in_loop_reduction_latency (vec_info
> *vinfo, stmt_vec_info stmt_info,
> >  {
> >    const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs;
> >    const sve_vec_cost *sve_costs = nullptr;
> > -  if (vec_flags & VEC_ANY_SVE)
> > +  if (vec_costs->sve
> > +      && vec_flags & VEC_ANY_SVE)
> >      sve_costs = aarch64_tune_params.vec_costs->sve;
> 
> This doesn't seem necessary.  I agree we might as well reuse the
> vec_costs local variable in the "if" body though.
> 
> > [...]
> > @@ -16794,9 +16785,10 @@ aarch64_detect_vector_stmt_subtype (vec_info
> *vinfo, vect_cost_for_stmt kind,
> >                                 fractional_cost stmt_cost)
> >  {
> >    const simd_vec_cost *simd_costs = aarch64_simd_vec_costs (vectype);
> > +  const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs;
> >    const sve_vec_cost *sve_costs = nullptr;
> > -  if (aarch64_sve_mode_p (TYPE_MODE (vectype)))
> > -    sve_costs = aarch64_tune_params.vec_costs->sve;
> > +  if (aarch64_sve_mode_p (TYPE_MODE (vectype)) && vec_costs->sve)
> > +    sve_costs = vec_costs->sve;
> 
> Similarly here, this doesn't seem necessary.
> 
> > [...]
> > @@ -17301,11 +17293,13 @@ aarch64_vector_costs::add_stmt_cost (int
> count, vect_cost_for_stmt kind,
> >                                                     where, stmt_cost);
> 
> >        /* Check if we've seen an SVE gather/scatter operation and which 
> > size.  */
> > +      const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs;
> >        if (kind == scalar_load
> >       && aarch64_sve_mode_p (TYPE_MODE (vectype))
> > -     && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) ==
> VMAT_GATHER_SCATTER)
> > +     && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) ==
> VMAT_GATHER_SCATTER
> > +     && vec_costs->sve)
> >     {
> > -     const sve_vec_cost *sve_costs = aarch64_tune_params.vec_costs->sve;
> > +     const sve_vec_cost *sve_costs = vec_costs->sve;
> >       if (sve_costs)
> >         {
> >           if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64)
> 
> Here too.
> 
> Thanks,
> Richard

Reply via email to