> -----Original Message----- > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: Tuesday, August 27, 2024 11:46 AM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: Jennifer Schmitz <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org; Kyrylo > Tkachov <ktkac...@exchange.nvidia.com> > Subject: Re: [RFC][PATCH] AArch64: Remove > AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > > Tamar Christina <tamar.christ...@arm.com> writes: > > Hi Jennifer, > > > >> -----Original Message----- > >> From: Jennifer Schmitz <jschm...@nvidia.com> > >> Sent: Friday, August 23, 2024 1:07 PM > >> To: gcc-patches@gcc.gnu.org > >> Cc: Richard Sandiford <richard.sandif...@arm.com>; Kyrylo Tkachov > >> <ktkac...@exchange.nvidia.com> > >> Subject: [RFC][PATCH] AArch64: Remove > >> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >> > >> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >> tunable and > >> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the > >> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the > >> default. > > Thanks for doing this. This has been on my TODO list ever since the > tunable was added. > > The history is that these "new" costs were originally added in stage 4 > of GCC 11 for Neoverse V1. Since the costs were added so late, it wasn't > appropriate to change the behaviour for any other core. All the new code > was therefore gated on this option. > > The new costs do two main things: > > (1) use throughput-based calculations where available, including to choose > between Advanced SIMD and SVE > > (2) try to make the latency-based costs more precise, by looking more closely > at the provided stmt_info > > Old cost models won't be affected by (1) either way, since they don't > provide any throughput information. But they should in principle benefit > from (2). So... > > >> To that end, the function aarch64_use_new_vector_costs_p and its uses were > >> removed. Additionally, guards were added prevent nullpointer dereferences > >> of > >> fields in cpu_vector_cost. > >> > > > > I'm not against this change, but it does mean that we now switch old Adv. > > SIMD > > cost models as well to the new throughput based cost models. That means > > that > > -mcpu=generic now behaves differently, and -mcpu=neoverse-n1 and I think > > some distros explicitly use this (I believe yocto for instance does). > > ...it shouldn't mean that we start using throughput-based models for > cortexa53 etc., since there's no associated issue info.
Yes, I was using throughput based model as a name. But as you indicated in (2) it does change the latency calculation. My question was because of things in e.g. aarch64_adjust_stmt_cost and friends, e.g. aarch64_multiply_add_p changes the cost between FMA SIMD vs scalar. So my question.. > > > Have we validated that the old generic cost model still behaves sensibly > > with this > change? is still valid I think, we *are* changing the cost for all models, and while they should indeed be more accurate, there could be knock on effects. Thanks, Tamar > > > >> The patch was bootstrapped and regtested on aarch64-linux-gnu: > >> No problems bootstrapping, but several test files (in aarch64-sve.exp: > >> gather_load_extend_X.c > >> where X is 1 to 4, strided_load_2.c, strided_store_2.c) fail because of > >> small > >> differences > >> in codegen that make some of the scan-assembler-times tests fail. > >> > >> Kyrill suggested to add a -fvect-cost-model=unlimited flag to these tests > >> and > add > >> some > > > > I don't personally like unlimited here as unlimited means just vectorize at > > any > > cost. This means that costing between modes are also disabled. A lot of > > these > > testcases are intended to test not just that we vectorize but that we > > vectorize > > with efficient code. > > > > I'd prefer to use -fvect-cost-model=dynamic if that fixes the testcases. > > Yeah, I don't think -fvect-cost-model=unlimited would work for the > gather_load_extend_X.c tests, since we require the cost model to decide > when to use extending loads vs loading a full vector and unpacking. > > [...tries patch...] > > It seems like three main things are contributing to the difference: > > 1. we now cost 0 for a scalar prologue extension from a loaded value > 2. we now cost gathers & scatters based on gather_load_x32/x64_cost > 3. we now apply a large one-off cost for gathers (Tamar's recent change) > > (1) is expected. > > (2) partly looks like a latent bug. We're using the x32 cost for > VNx2QI->VNx2SI, even though those are really .B->.D loads. > > @@ -16819,7 +16811,7 @@ aarch64_detect_vector_stmt_subtype (vec_info > *vinfo, vect_cost_for_stmt kind, > && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == > VMAT_GATHER_SCATTER) > { > unsigned int nunits = vect_nunits_for_cost (vectype); > - if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64) > + if (known_eq (GET_MODE_NUNITS (TYPE_MODE (vectype)), > aarch64_sve_vg)) > return { sve_costs->gather_load_x64_cost, nunits }; > return { sve_costs->gather_load_x32_cost, nunits }; > } > > fixes that. > > (3) is interesting. generic_vector_cost isn't used by default for any > SVE CPU, or any -march=...+sve. So the question is: should we treat it > as "architectural intent"/somewhat idealised? Or should we try to make > it produce good code for existing SVE cores, in which case it would > overlap quite a lot with generic_armv8_a and generic_armv9_a. > > If the former, we could set gather_load_x32_init_cost and > gather_load_x64_init_cost to 0 for generic_sve_vector_cost > (and nothing else, so that generic_armv*_a are unaffected). > > On the patch: > > > @@ -16733,7 +16723,8 @@ aarch64_in_loop_reduction_latency (vec_info > *vinfo, stmt_vec_info stmt_info, > > { > > const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs; > > const sve_vec_cost *sve_costs = nullptr; > > - if (vec_flags & VEC_ANY_SVE) > > + if (vec_costs->sve > > + && vec_flags & VEC_ANY_SVE) > > sve_costs = aarch64_tune_params.vec_costs->sve; > > This doesn't seem necessary. I agree we might as well reuse the > vec_costs local variable in the "if" body though. > > > [...] > > @@ -16794,9 +16785,10 @@ aarch64_detect_vector_stmt_subtype (vec_info > *vinfo, vect_cost_for_stmt kind, > > fractional_cost stmt_cost) > > { > > const simd_vec_cost *simd_costs = aarch64_simd_vec_costs (vectype); > > + const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs; > > const sve_vec_cost *sve_costs = nullptr; > > - if (aarch64_sve_mode_p (TYPE_MODE (vectype))) > > - sve_costs = aarch64_tune_params.vec_costs->sve; > > + if (aarch64_sve_mode_p (TYPE_MODE (vectype)) && vec_costs->sve) > > + sve_costs = vec_costs->sve; > > Similarly here, this doesn't seem necessary. > > > [...] > > @@ -17301,11 +17293,13 @@ aarch64_vector_costs::add_stmt_cost (int > count, vect_cost_for_stmt kind, > > where, stmt_cost); > > > /* Check if we've seen an SVE gather/scatter operation and which > > size. */ > > + const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs; > > if (kind == scalar_load > > && aarch64_sve_mode_p (TYPE_MODE (vectype)) > > - && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == > VMAT_GATHER_SCATTER) > > + && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == > VMAT_GATHER_SCATTER > > + && vec_costs->sve) > > { > > - const sve_vec_cost *sve_costs = aarch64_tune_params.vec_costs->sve; > > + const sve_vec_cost *sve_costs = vec_costs->sve; > > if (sve_costs) > > { > > if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64) > > Here too. > > Thanks, > Richard