On Wed, Dec 18, 2024 at 6:30 PM Jennifer Schmitz <jschm...@nvidia.com> wrote: > > > > > On 17 Dec 2024, at 18:57, Richard Biener <rguent...@suse.de> wrote: > > > > External email: Use caution opening links or attachments > > > > > >> Am 16.12.2024 um 09:10 schrieb Jennifer Schmitz <jschm...@nvidia.com>: > >> > >> > >> > >>> On 14 Dec 2024, at 09:32, Richard Biener <rguent...@suse.de> wrote: > >>> > >>> External email: Use caution opening links or attachments > >>> > >>> > >>>>> Am 13.12.2024 um 18:00 schrieb Jennifer Schmitz <jschm...@nvidia.com>: > >>>> > >>>> > >>>> > >>>>> On 13 Dec 2024, at 13:40, Richard Biener <richard.guent...@gmail.com> > >>>>> wrote: > >>>>> > >>>>> External email: Use caution opening links or attachments > >>>>> > >>>>> > >>>>>> On Thu, Dec 12, 2024 at 5:27 PM Jennifer Schmitz <jschm...@nvidia.com> > >>>>>> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>>> On 6 Dec 2024, at 08:41, Jennifer Schmitz <jschm...@nvidia.com> wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> On 5 Dec 2024, at 20:07, Richard Sandiford > >>>>>>>> <richard.sandif...@arm.com> wrote: > >>>>>>>> > >>>>>>>> External email: Use caution opening links or attachments > >>>>>>>> > >>>>>>>> > >>>>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes: > >>>>>>>>>> On 5 Dec 2024, at 11:44, Richard Biener <rguent...@suse.de> wrote: > >>>>>>>>>> > >>>>>>>>>> External email: Use caution opening links or attachments > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Thu, 5 Dec 2024, Jennifer Schmitz wrote: > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> On 17 Oct 2024, at 19:23, Richard Sandiford > >>>>>>>>>>>> <richard.sandif...@arm.com> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> External email: Use caution opening links or attachments > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes: > >>>>>>>>>>>>> [...] > >>>>>>>>>>>>> Looking at the diff of the vect dumps (below is a section of > >>>>>>>>>>>>> the diff for strided_store_2.c), it seemed odd that > >>>>>>>>>>>>> vec_to_scalar operations cost 0 now, instead of the previous > >>>>>>>>>>>>> cost of 2: > >>>>>>>>>>>>> > >>>>>>>>>>>>> +strided_store_1.c:38:151: note: === vectorizable_operation > >>>>>>>>>>>>> === > >>>>>>>>>>>>> +strided_store_1.c:38:151: note: vect_model_simple_cost: > >>>>>>>>>>>>> inside_cost = 1, prologue_cost = 0 . > >>>>>>>>>>>>> +strided_store_1.c:38:151: note: ==> examining statement: *_6 > >>>>>>>>>>>>> = _7; > >>>>>>>>>>>>> +strided_store_1.c:38:151: note: vect_is_simple_use: operand > >>>>>>>>>>>>> _3 + 1.0e+0, type of def: internal > >>>>>>>>>>>>> +strided_store_1.c:38:151: note: Vectorizing an unaligned > >>>>>>>>>>>>> access. > >>>>>>>>>>>>> +Applying pattern match.pd:236, generic-match-9.cc:4128 > >>>>>>>>>>>>> +Applying pattern match.pd:5285, generic-match-10.cc:4234 > >>>>>>>>>>>>> +strided_store_1.c:38:151: note: vect_model_store_cost: > >>>>>>>>>>>>> inside_cost = 12, prologue_cost = 0 . > >>>>>>>>>>>>> *_2 1 times unaligned_load (misalign -1) costs 1 in body > >>>>>>>>>>>>> -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue > >>>>>>>>>>>>> _3 + 1.0e+0 1 times vector_stmt costs 1 in body > >>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body > >>>>>>>>>>>>> +<unknown> 1 times vector_load costs 1 in prologue > >>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body > >>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body > >>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body > >>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body > >>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body > >>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body > >>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body > >>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body > >>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body > >>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body > >>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body > >>>>>>>>>>>>> > >>>>>>>>>>>>> Although the aarch64_use_new_vector_costs_p flag was used in > >>>>>>>>>>>>> multiple places in aarch64.cc, the location that causes this > >>>>>>>>>>>>> behavior is this one: > >>>>>>>>>>>>> unsigned > >>>>>>>>>>>>> aarch64_vector_costs::add_stmt_cost (int count, > >>>>>>>>>>>>> vect_cost_for_stmt kind, > >>>>>>>>>>>>> stmt_vec_info stmt_info, slp_tree, > >>>>>>>>>>>>> tree vectype, int misalign, > >>>>>>>>>>>>> vect_cost_model_location where) > >>>>>>>>>>>>> { > >>>>>>>>>>>>> [...] > >>>>>>>>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO > >>>>>>>>>>>>> instead > >>>>>>>>>>>>> of just looking at KIND. */ > >>>>>>>>>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) > >>>>>>>>>>>>> + if (stmt_info) > >>>>>>>>>>>>> { > >>>>>>>>>>>>> /* If we scalarize a strided store, the vectorizer costs one > >>>>>>>>>>>>> vec_to_scalar for each element. However, we can store the first > >>>>>>>>>>>>> element using an FP store without a separate extract step. */ > >>>>>>>>>>>>> if (vect_is_store_elt_extraction (kind, stmt_info)) > >>>>>>>>>>>>> count -= 1; > >>>>>>>>>>>>> > >>>>>>>>>>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind, > >>>>>>>>>>>>> stmt_info, > >>>>>>>>>>>>> stmt_cost); > >>>>>>>>>>>>> > >>>>>>>>>>>>> if (vectype && m_vec_flags) > >>>>>>>>>>>>> stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind, > >>>>>>>>>>>>> stmt_info, > >>>>>>>>>>>>> vectype, > >>>>>>>>>>>>> where, > >>>>>>>>>>>>> stmt_cost); > >>>>>>>>>>>>> } > >>>>>>>>>>>>> [...] > >>>>>>>>>>>>> return record_stmt_cost (stmt_info, where, (count * > >>>>>>>>>>>>> stmt_cost).ceil ()); > >>>>>>>>>>>>> } > >>>>>>>>>>>>> > >>>>>>>>>>>>> Previously, for mtune=generic, this function returned a cost of > >>>>>>>>>>>>> 2 for a vec_to_scalar operation in the vect body. Now "if > >>>>>>>>>>>>> (stmt_info)" is entered and "if (vect_is_store_elt_extraction > >>>>>>>>>>>>> (kind, stmt_info))" evaluates to true, which sets the count to > >>>>>>>>>>>>> 0 and leads to a return value of 0. > >>>>>>>>>>>> > >>>>>>>>>>>> At the time the code was written, a scalarised store would be > >>>>>>>>>>>> costed > >>>>>>>>>>>> using one vec_to_scalar call into the backend, with the count > >>>>>>>>>>>> parameter > >>>>>>>>>>>> set to the number of elements being stored. The "count -= 1" was > >>>>>>>>>>>> supposed to lop off the leading element extraction, since we can > >>>>>>>>>>>> store > >>>>>>>>>>>> lane 0 as a normal FP store. > >>>>>>>>>>>> > >>>>>>>>>>>> The target-independent costing was later reworked so that it > >>>>>>>>>>>> costs > >>>>>>>>>>>> each operation individually: > >>>>>>>>>>>> > >>>>>>>>>>>> for (i = 0; i < nstores; i++) > >>>>>>>>>>>> { > >>>>>>>>>>>> if (costing_p) > >>>>>>>>>>>> { > >>>>>>>>>>>> /* Only need vector extracting when there are more > >>>>>>>>>>>> than one stores. */ > >>>>>>>>>>>> if (nstores > 1) > >>>>>>>>>>>> inside_cost > >>>>>>>>>>>> += record_stmt_cost (cost_vec, 1, vec_to_scalar, > >>>>>>>>>>>> stmt_info, 0, vect_body); > >>>>>>>>>>>> /* Take a single lane vector type store as scalar > >>>>>>>>>>>> store to avoid ICE like 110776. */ > >>>>>>>>>>>> if (VECTOR_TYPE_P (ltype) > >>>>>>>>>>>> && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) > >>>>>>>>>>>> n_adjacent_stores++; > >>>>>>>>>>>> else > >>>>>>>>>>>> inside_cost > >>>>>>>>>>>> += record_stmt_cost (cost_vec, 1, scalar_store, > >>>>>>>>>>>> stmt_info, 0, vect_body); > >>>>>>>>>>>> continue; > >>>>>>>>>>>> } > >>>>>>>>>>>> > >>>>>>>>>>>> Unfortunately, there's no easy way of telling whether a > >>>>>>>>>>>> particular call > >>>>>>>>>>>> is part of a group, and if so, which member of the group it is. > >>>>>>>>>>>> > >>>>>>>>>>>> I suppose we could give up on the attempt to be (somewhat) > >>>>>>>>>>>> accurate > >>>>>>>>>>>> and just disable the optimisation. Or we could restrict it to > >>>>>>>>>>>> count > 1, > >>>>>>>>>>>> since it might still be useful for gathers and scatters. > >>>>>>>>>>> I tried restricting the calls to vect_is_store_elt_extraction to > >>>>>>>>>>> count > 1 and it seems to resolve the issue of costing > >>>>>>>>>>> vec_to_scalar operations with 0 (see patch below). > >>>>>>>>>>> What are your thoughts on this? > >>>>>>>>>> > >>>>>>>>>> Why didn't you pursue instead moving the vec_to_scalar cost > >>>>>>>>>> together > >>>>>>>>>> with the n_adjacent_store handling? > >>>>>>>>> When I continued working on this patch, we had already reached > >>>>>>>>> stage 3 and I was hesitant to introduce changes to the middle-end > >>>>>>>>> that were not previously covered by this patch. So I tried if the > >>>>>>>>> issue could not be resolved by making a small change in the backend. > >>>>>>>>> If you still advise to use the n_adjacent_store instead, I’m happy > >>>>>>>>> to look into it again. > >>>>>>>> > >>>>>>>> If Richard's ok with adjusting vectorizable_store for GCC 15 (which > >>>>>>>> it > >>>>>>>> sounds like he is), then I agree that would be better. Otherwise > >>>>>>>> we'd > >>>>>>>> be creating technical debt to clean up for GCC 16. And it is a > >>>>>>>> regression > >>>>>>>> of sorts, so is stage 3 material from that POV. > >>>>>>>> > >>>>>>>> (Incidentally, AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS was itself a > >>>>>>>> "let's clean this up next stage 1" thing, since we needed to add > >>>>>>>> tuning > >>>>>>>> for a new CPU late during the cycle. But of course, there were other > >>>>>>>> priorities when stage 1 actually came around, so it never actually > >>>>>>>> happened. Thanks again for being the one to sort this out.) > >>>>>>> Thanks for your feedback. Then I will try to make it work in > >>>>>>> vectorizable_store. > >>>>>>> Best, > >>>>>>> Jennifer > >>>>>> Below is the updated patch with a suggestion for the changes in > >>>>>> vectorizable_store. It resolves the issue with the vec_to_scalar > >>>>>> operations that were individually costed with 0. > >>>>>> We already tested it on aarch64, no regression, but we are still doing > >>>>>> performance testing. > >>>>>> Can you give some feedback in the meantime on the patch itself? > >>>>>> Thanks, > >>>>>> Jennifer > >>>>>> > >>>>>> > >>>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable > >>>>>> and > >>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the > >>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the > >>>>>> default. To that end, the function aarch64_use_new_vector_costs_p and > >>>>>> its uses > >>>>>> were removed. To prevent costing vec_to_scalar operations with 0, as > >>>>>> described in > >>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html, > >>>>>> we adjusted vectorizable_store such that the variable n_adjacent_stores > >>>>>> also covers vec_to_scalar operations. This way vec_to_scalar operations > >>>>>> are not costed individually, but as a group. > >>>>>> > >>>>>> Two tests were adjusted due to changes in codegen. In both cases, the > >>>>>> old code performed loop unrolling once, but the new code does not: > >>>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with > >>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic > >>>>>> -moverride=tune=none): > >>>>>> f_int64_t_32: > >>>>>> cbz w3, .L92 > >>>>>> mov x4, 0 > >>>>>> uxtw x3, w3 > >>>>>> + cntd x5 > >>>>>> + whilelo p7.d, xzr, x3 > >>>>>> + mov z29.s, w5 > >>>>>> mov z31.s, w2 > >>>>>> - whilelo p6.d, xzr, x3 > >>>>>> - mov x2, x3 > >>>>>> - index z30.s, #0, #1 > >>>>>> - uqdecd x2 > >>>>>> - ptrue p5.b, all > >>>>>> - whilelo p7.d, xzr, x2 > >>>>>> + index z30.d, #0, #1 > >>>>>> + ptrue p6.b, all > >>>>>> .p2align 3,,7 > >>>>>> .L94: > >>>>>> - ld1d z27.d, p7/z, [x0, #1, mul vl] > >>>>>> - ld1d z28.d, p6/z, [x0] > >>>>>> - movprfx z29, z31 > >>>>>> - mul z29.s, p5/m, z29.s, z30.s > >>>>>> - incw x4 > >>>>>> - uunpklo z0.d, z29.s > >>>>>> - uunpkhi z29.d, z29.s > >>>>>> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3] > >>>>>> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3] > >>>>>> - add z25.d, z28.d, z25.d > >>>>>> + ld1d z27.d, p7/z, [x0, x4, lsl 3] > >>>>>> + movprfx z28, z31 > >>>>>> + mul z28.s, p6/m, z28.s, z30.s > >>>>>> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3] > >>>>>> add z26.d, z27.d, z26.d > >>>>>> - st1d z26.d, p7, [x0, #1, mul vl] > >>>>>> - whilelo p7.d, x4, x2 > >>>>>> - st1d z25.d, p6, [x0] > >>>>>> - incw z30.s > >>>>>> - incb x0, all, mul #2 > >>>>>> - whilelo p6.d, x4, x3 > >>>>>> + st1d z26.d, p7, [x0, x4, lsl 3] > >>>>>> + add z30.s, z30.s, z29.s > >>>>>> + incd x4 > >>>>>> + whilelo p7.d, x4, x3 > >>>>>> b.any .L94 > >>>>>> .L92: > >>>>>> ret > >>>>>> > >>>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with > >>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic > >>>>>> -moverride=tune=none): > >>>>>> f_int64_t_32: > >>>>>> cbz w3, .L84 > >>>>>> - addvl x5, x1, #1 > >>>>>> mov x4, 0 > >>>>>> uxtw x3, w3 > >>>>>> - mov z31.s, w2 > >>>>>> + cntd x5 > >>>>>> whilelo p7.d, xzr, x3 > >>>>>> - mov x2, x3 > >>>>>> - index z30.s, #0, #1 > >>>>>> - uqdecd x2 > >>>>>> - ptrue p5.b, all > >>>>>> - whilelo p6.d, xzr, x2 > >>>>>> + mov z29.s, w5 > >>>>>> + mov z31.s, w2 > >>>>>> + index z30.d, #0, #1 > >>>>>> + ptrue p6.b, all > >>>>>> .p2align 3,,7 > >>>>>> .L86: > >>>>>> - ld1d z28.d, p7/z, [x1, x4, lsl 3] > >>>>>> - ld1d z27.d, p6/z, [x5, x4, lsl 3] > >>>>>> - movprfx z29, z30 > >>>>>> - mul z29.s, p5/m, z29.s, z31.s > >>>>>> - add z28.d, z28.d, #1 > >>>>>> - uunpklo z26.d, z29.s > >>>>>> - st1d z28.d, p7, [x0, z26.d, lsl 3] > >>>>>> - incw x4 > >>>>>> - uunpkhi z29.d, z29.s > >>>>>> + ld1d z27.d, p7/z, [x1, x4, lsl 3] > >>>>>> + movprfx z28, z30 > >>>>>> + mul z28.s, p6/m, z28.s, z31.s > >>>>>> add z27.d, z27.d, #1 > >>>>>> - whilelo p6.d, x4, x2 > >>>>>> - st1d z27.d, p7, [x0, z29.d, lsl 3] > >>>>>> - incw z30.s > >>>>>> + st1d z27.d, p7, [x0, z28.d, uxtw 3] > >>>>>> + incd x4 > >>>>>> + add z30.s, z30.s, z29.s > >>>>>> whilelo p7.d, x4, x3 > >>>>>> b.any .L86 > >>>>>> .L84: > >>>>>> ret > >>>>>> > >>>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no > >>>>>> regression. > >>>>>> OK for mainline? > >>>>>> > >>>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> > >>>>>> > >>>>>> gcc/ > >>>>>> * tree-vect-stmts.cc (vectorizable_store): Extend the use of > >>>>>> n_adjacent_stores to also cover vec_to_scalar operations. > >>>>>> * config/aarch64/aarch64-tuning-flags.def: Remove > >>>>>> use_new_vector_costs as tuning option. > >>>>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p): > >>>>>> Remove. > >>>>>> (aarch64_vector_costs::add_stmt_cost): Remove use of > >>>>>> aarch64_use_new_vector_costs_p. > >>>>>> (aarch64_vector_costs::finish_cost): Remove use of > >>>>>> aarch64_use_new_vector_costs_p. > >>>>>> * config/aarch64/tuning_models/cortexx925.h: Remove > >>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS. > >>>>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise. > >>>>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise. > >>>>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise. > >>>>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise. > >>>>>> * config/aarch64/tuning_models/neoversen2.h: Likewise. > >>>>>> * config/aarch64/tuning_models/neoversen3.h: Likewise. > >>>>>> * config/aarch64/tuning_models/neoversev1.h: Likewise. > >>>>>> * config/aarch64/tuning_models/neoversev2.h: Likewise. > >>>>>> * config/aarch64/tuning_models/neoversev3.h: Likewise. > >>>>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise. > >>>>>> > >>>>>> gcc/testsuite/ > >>>>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome. > >>>>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise. > >>>>>> --- > >>>>>> gcc/config/aarch64/aarch64-tuning-flags.def | 2 -- > >>>>>> gcc/config/aarch64/aarch64.cc | 20 +++---------- > >>>>>> gcc/config/aarch64/tuning_models/cortexx925.h | 1 - > >>>>>> .../aarch64/tuning_models/fujitsu_monaka.h | 1 - > >>>>>> .../aarch64/tuning_models/generic_armv8_a.h | 1 - > >>>>>> .../aarch64/tuning_models/generic_armv9_a.h | 1 - > >>>>>> .../aarch64/tuning_models/neoverse512tvb.h | 1 - > >>>>>> gcc/config/aarch64/tuning_models/neoversen2.h | 1 - > >>>>>> gcc/config/aarch64/tuning_models/neoversen3.h | 1 - > >>>>>> gcc/config/aarch64/tuning_models/neoversev1.h | 1 - > >>>>>> gcc/config/aarch64/tuning_models/neoversev2.h | 1 - > >>>>>> gcc/config/aarch64/tuning_models/neoversev3.h | 1 - > >>>>>> .../aarch64/tuning_models/neoversev3ae.h | 1 - > >>>>>> .../gcc.target/aarch64/sve/strided_load_2.c | 2 +- > >>>>>> .../gcc.target/aarch64/sve/strided_store_2.c | 2 +- > >>>>>> gcc/tree-vect-stmts.cc | 29 ++++++++++--------- > >>>>>> 16 files changed, 22 insertions(+), 44 deletions(-) > >>>>>> > >>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def > >>>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def > >>>>>> index ffbff20e29c..1de633c739b 100644 > >>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def > >>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def > >>>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", > >>>>>> CHEAP_SHIFT_EXTEND) > >>>>>> > >>>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", > >>>>>> CSE_SVE_VL_CONSTANTS) > >>>>>> > >>>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", > >>>>>> USE_NEW_VECTOR_COSTS) > >>>>>> - > >>>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", > >>>>>> MATCHED_VECTOR_THROUGHPUT) > >>>>>> > >>>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", > >>>>>> AVOID_CROSS_LOOP_FMA) > >>>>>> diff --git a/gcc/config/aarch64/aarch64.cc > >>>>>> b/gcc/config/aarch64/aarch64.cc > >>>>>> index 77a2a6bfa3a..71fba9cc63b 100644 > >>>>>> --- a/gcc/config/aarch64/aarch64.cc > >>>>>> +++ b/gcc/config/aarch64/aarch64.cc > >>>>>> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info > >>>>>> *vinfo, bool costing_for_scalar) > >>>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar); > >>>>>> } > >>>>>> > >>>>>> -/* Return true if the current CPU should use the new costs defined > >>>>>> - in GCC 11. This should be removed for GCC 12 and above, with the > >>>>>> - costs applying to all CPUs instead. */ > >>>>>> -static bool > >>>>>> -aarch64_use_new_vector_costs_p () > >>>>>> -{ > >>>>>> - return (aarch64_tune_params.extra_tuning_flags > >>>>>> - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS); > >>>>>> -} > >>>>>> - > >>>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE. */ > >>>>>> static const simd_vec_cost * > >>>>>> aarch64_simd_vec_costs (tree vectype) > >>>>>> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int > >>>>>> count, vect_cost_for_stmt kind, > >>>>>> > >>>>>> /* Do one-time initialization based on the vinfo. */ > >>>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); > >>>>>> - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ()) > >>>>>> + if (!m_analyzed_vinfo) > >>>>>> { > >>>>>> if (loop_vinfo) > >>>>>> analyze_loop_vinfo (loop_vinfo); > >>>>>> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int > >>>>>> count, vect_cost_for_stmt kind, > >>>>>> > >>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead > >>>>>> of just looking at KIND. */ > >>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) > >>>>>> + if (stmt_info) > >>>>>> { > >>>>>> /* If we scalarize a strided store, the vectorizer costs one > >>>>>> vec_to_scalar for each element. However, we can store the first > >>>>>> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int > >>>>>> count, vect_cost_for_stmt kind, > >>>>>> else > >>>>>> m_num_last_promote_demote = 0; > >>>>>> > >>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) > >>>>>> + if (stmt_info) > >>>>>> { > >>>>>> /* Account for any extra "embedded" costs that apply additively > >>>>>> to the base cost calculated above. */ > >>>>>> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const > >>>>>> vector_costs *uncast_scalar_costs) > >>>>>> > >>>>>> auto *scalar_costs > >>>>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs); > >>>>>> - if (loop_vinfo > >>>>>> - && m_vec_flags > >>>>>> - && aarch64_use_new_vector_costs_p ()) > >>>>>> + if (loop_vinfo && m_vec_flags) > >>>>>> { > >>>>>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, > >>>>>> m_costs[vect_body]); > >>>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h > >>>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h > >>>>>> index b2ff716157a..0a8eff69307 100644 > >>>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h > >>>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h > >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params cortexx925_tunings > >>>>>> = > >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>> (AARCH64_EXTRA_TUNE_BASE > >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > >>>>>> &generic_prefetch_tune, > >>>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > >>>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > >>>>>> index 2d704ecd110..a564528f43d 100644 > >>>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > >>>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > >>>>>> @@ -55,7 +55,6 @@ static const struct tune_params > >>>>>> fujitsu_monaka_tunings = > >>>>>> 0, /* max_case_values. */ > >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>> (AARCH64_EXTRA_TUNE_BASE > >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > >>>>>> &generic_prefetch_tune, > >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > >>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h > >>>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > >>>>>> index bdd309ab03d..f090d5cde50 100644 > >>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h > >>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > >>>>>> @@ -183,7 +183,6 @@ static const struct tune_params > >>>>>> generic_armv8_a_tunings = > >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>> (AARCH64_EXTRA_TUNE_BASE > >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > >>>>>> &generic_prefetch_tune, > >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > >>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h > >>>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > >>>>>> index a05a9ab92a2..4c33c147444 100644 > >>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h > >>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > >>>>>> @@ -249,7 +249,6 @@ static const struct tune_params > >>>>>> generic_armv9_a_tunings = > >>>>>> 0, /* max_case_values. */ > >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>> (AARCH64_EXTRA_TUNE_BASE > >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > >>>>>> &generic_armv9a_prefetch_tune, > >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h > >>>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > >>>>>> index c407b89a22f..fe4f7c10f73 100644 > >>>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h > >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > >>>>>> @@ -156,7 +156,6 @@ static const struct tune_params > >>>>>> neoverse512tvb_tunings = > >>>>>> 0, /* max_case_values. */ > >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > >>>>>> &generic_prefetch_tune, > >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h > >>>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h > >>>>>> index fd5f8f37370..0c74068da2c 100644 > >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h > >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h > >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings > >>>>>> = > >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>> (AARCH64_EXTRA_TUNE_BASE > >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > >>>>>> &generic_prefetch_tune, > >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h > >>>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h > >>>>>> index 8b156c2fe4d..9d4e1be171a 100644 > >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h > >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h > >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings > >>>>>> = > >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>> (AARCH64_EXTRA_TUNE_BASE > >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > >>>>>> &generic_prefetch_tune, > >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h > >>>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h > >>>>>> index 23c121d8652..85a78bb2bef 100644 > >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h > >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h > >>>>>> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings > >>>>>> = > >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>> (AARCH64_EXTRA_TUNE_BASE > >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > >>>>>> &generic_prefetch_tune, > >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h > >>>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h > >>>>>> index 40af5f47f4f..1dd452beb8d 100644 > >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h > >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h > >>>>>> @@ -232,7 +232,6 @@ static const struct tune_params neoversev2_tunings > >>>>>> = > >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>> (AARCH64_EXTRA_TUNE_BASE > >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW > >>>>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */ > >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h > >>>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h > >>>>>> index d65d74bfecf..d0ba5b1aef6 100644 > >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h > >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h > >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings > >>>>>> = > >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>> (AARCH64_EXTRA_TUNE_BASE > >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > >>>>>> &generic_prefetch_tune, > >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h > >>>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h > >>>>>> index 7b7fa0b4b08..a1572048503 100644 > >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h > >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h > >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params > >>>>>> neoversev3ae_tunings = > >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>> (AARCH64_EXTRA_TUNE_BASE > >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > >>>>>> &generic_prefetch_tune, > >>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > >>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > >>>>>> index 762805ff54b..c334b7a6875 100644 > >>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > >>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > >>>>>> @@ -15,4 +15,4 @@ > >>>>>> so we vectorize the offset calculation. This means that the > >>>>>> 64-bit version needs two copies. */ > >>>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, > >>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ > >>>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, > >>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ > >>>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, > >>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ > >>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > >>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > >>>>>> index f0ea58e38e2..94cc63049bc 100644 > >>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > >>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > >>>>>> @@ -15,4 +15,4 @@ > >>>>>> so we vectorize the offset calculation. This means that the > >>>>>> 64-bit version needs two copies. */ > >>>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], > >>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ > >>>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], > >>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ > >>>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], > >>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ > >>>>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > >>>>>> index be1139a423c..6d7d28c4702 100644 > >>>>>> --- a/gcc/tree-vect-stmts.cc > >>>>>> +++ b/gcc/tree-vect-stmts.cc > >>>>>> @@ -8834,19 +8834,16 @@ vectorizable_store (vec_info *vinfo, > >>>>>> { > >>>>>> if (costing_p) > >>>>>> { > >>>>>> - /* Only need vector extracting when there are > >>>>>> more > >>>>>> - than one stores. */ > >>>>>> - if (nstores > 1) > >>>>>> - inside_cost > >>>>>> - += record_stmt_cost (cost_vec, 1, > >>>>>> vec_to_scalar, > >>>>>> - stmt_info, slp_node, > >>>>>> - 0, vect_body); > >>>>>> /* Take a single lane vector type store as scalar > >>>>>> store to avoid ICE like 110776. */ > >>>>>> - if (VECTOR_TYPE_P (ltype) > >>>>>> - && known_ne (TYPE_VECTOR_SUBPARTS (ltype), > >>>>>> 1U)) > >>>>>> + bool single_lane_vec_p = > >>>>>> + VECTOR_TYPE_P (ltype) > >>>>>> + && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U); > >>>>>> + /* Only need vector extracting when there are > >>>>>> more > >>>>>> + than one stores. */ > >>>>>> + if (nstores > 1 || single_lane_vec_p) > >>>>>> n_adjacent_stores++; > >>>>>> - else > >>>>>> + if (!single_lane_vec_p) > >>>>> > >>>>> I think it's somewhat non-obvious that nstores > 1 and single_lane_vec_p > >>>>> correlate. In fact I think that we always record a store, just for > >>>>> single-element > >>>>> vectors we record scalar stores. I suggest to here always to just > >>>>> n_adjacent_stores++ > >>>>> and below ... > >>>>> > >>>>>> inside_cost > >>>>>> += record_stmt_cost (cost_vec, 1, scalar_store, > >>>>>> stmt_info, 0, vect_body); > >>>>>> @@ -8905,9 +8902,15 @@ vectorizable_store (vec_info *vinfo, > >>>>>> if (costing_p) > >>>>>> { > >>>>>> if (n_adjacent_stores > 0) > >>>>>> - vect_get_store_cost (vinfo, stmt_info, slp_node, > >>>>>> n_adjacent_stores, > >>>>>> - alignment_support_scheme, > >>>>>> misalignment, > >>>>>> - &inside_cost, cost_vec); > >>>>>> + { > >>>>>> + vect_get_store_cost (vinfo, stmt_info, slp_node, > >>>>>> n_adjacent_stores, > >>>>>> + alignment_support_scheme, > >>>>>> misalignment, > >>>>>> + &inside_cost, cost_vec); > >>>>> > >>>>> ... record n_adjacent_stores scalar_store when ltype is single-lane and > >>>>> record > >>>>> n_adjacent_stores vect_to_scalar if nstores > 1 (and else none). > >>>>> > >>>>> Richard. > >>>> Thanks for the feedback, I’m glad it’s going in the right direction. > >>>> Below is the updated patch, re-validated on aarch64. > >>>> Thanks, Jennifer > >>>> > >>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable > >>>> and > >>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the > >>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the > >>>> default. To that end, the function aarch64_use_new_vector_costs_p and > >>>> its uses > >>>> were removed. To prevent costing vec_to_scalar operations with 0, as > >>>> described in > >>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html, > >>>> we adjusted vectorizable_store such that the variable n_adjacent_stores > >>>> also covers vec_to_scalar operations. This way vec_to_scalar operations > >>>> are not costed individually, but as a group. > >>>> > >>>> Two tests were adjusted due to changes in codegen. In both cases, the > >>>> old code performed loop unrolling once, but the new code does not: > >>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with > >>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic > >>>> -moverride=tune=none): > >>>> f_int64_t_32: > >>>> cbz w3, .L92 > >>>> mov x4, 0 > >>>> uxtw x3, w3 > >>>> + cntd x5 > >>>> + whilelo p7.d, xzr, x3 > >>>> + mov z29.s, w5 > >>>> mov z31.s, w2 > >>>> - whilelo p6.d, xzr, x3 > >>>> - mov x2, x3 > >>>> - index z30.s, #0, #1 > >>>> - uqdecd x2 > >>>> - ptrue p5.b, all > >>>> - whilelo p7.d, xzr, x2 > >>>> + index z30.d, #0, #1 > >>>> + ptrue p6.b, all > >>>> .p2align 3,,7 > >>>> .L94: > >>>> - ld1d z27.d, p7/z, [x0, #1, mul vl] > >>>> - ld1d z28.d, p6/z, [x0] > >>>> - movprfx z29, z31 > >>>> - mul z29.s, p5/m, z29.s, z30.s > >>>> - incw x4 > >>>> - uunpklo z0.d, z29.s > >>>> - uunpkhi z29.d, z29.s > >>>> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3] > >>>> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3] > >>>> - add z25.d, z28.d, z25.d > >>>> + ld1d z27.d, p7/z, [x0, x4, lsl 3] > >>>> + movprfx z28, z31 > >>>> + mul z28.s, p6/m, z28.s, z30.s > >>>> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3] > >>>> add z26.d, z27.d, z26.d > >>>> - st1d z26.d, p7, [x0, #1, mul vl] > >>>> - whilelo p7.d, x4, x2 > >>>> - st1d z25.d, p6, [x0] > >>>> - incw z30.s > >>>> - incb x0, all, mul #2 > >>>> - whilelo p6.d, x4, x3 > >>>> + st1d z26.d, p7, [x0, x4, lsl 3] > >>>> + add z30.s, z30.s, z29.s > >>>> + incd x4 > >>>> + whilelo p7.d, x4, x3 > >>>> b.any .L94 > >>>> .L92: > >>>> ret > >>>> > >>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with > >>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic > >>>> -moverride=tune=none): > >>>> f_int64_t_32: > >>>> cbz w3, .L84 > >>>> - addvl x5, x1, #1 > >>>> mov x4, 0 > >>>> uxtw x3, w3 > >>>> - mov z31.s, w2 > >>>> + cntd x5 > >>>> whilelo p7.d, xzr, x3 > >>>> - mov x2, x3 > >>>> - index z30.s, #0, #1 > >>>> - uqdecd x2 > >>>> - ptrue p5.b, all > >>>> - whilelo p6.d, xzr, x2 > >>>> + mov z29.s, w5 > >>>> + mov z31.s, w2 > >>>> + index z30.d, #0, #1 > >>>> + ptrue p6.b, all > >>>> .p2align 3,,7 > >>>> .L86: > >>>> - ld1d z28.d, p7/z, [x1, x4, lsl 3] > >>>> - ld1d z27.d, p6/z, [x5, x4, lsl 3] > >>>> - movprfx z29, z30 > >>>> - mul z29.s, p5/m, z29.s, z31.s > >>>> - add z28.d, z28.d, #1 > >>>> - uunpklo z26.d, z29.s > >>>> - st1d z28.d, p7, [x0, z26.d, lsl 3] > >>>> - incw x4 > >>>> - uunpkhi z29.d, z29.s > >>>> + ld1d z27.d, p7/z, [x1, x4, lsl 3] > >>>> + movprfx z28, z30 > >>>> + mul z28.s, p6/m, z28.s, z31.s > >>>> add z27.d, z27.d, #1 > >>>> - whilelo p6.d, x4, x2 > >>>> - st1d z27.d, p7, [x0, z29.d, lsl 3] > >>>> - incw z30.s > >>>> + st1d z27.d, p7, [x0, z28.d, uxtw 3] > >>>> + incd x4 > >>>> + add z30.s, z30.s, z29.s > >>>> whilelo p7.d, x4, x3 > >>>> b.any .L86 > >>>> .L84: > >>>> ret > >>>> > >>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no > >>>> regression. > >>>> OK for mainline? > >>>> > >>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> > >>>> > >>>> gcc/ > >>>> * tree-vect-stmts.cc (vectorizable_store): Extend the use of > >>>> n_adjacent_stores to also cover vec_to_scalar operations. > >>>> * config/aarch64/aarch64-tuning-flags.def: Remove > >>>> use_new_vector_costs as tuning option. > >>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p): > >>>> Remove. > >>>> (aarch64_vector_costs::add_stmt_cost): Remove use of > >>>> aarch64_use_new_vector_costs_p. > >>>> (aarch64_vector_costs::finish_cost): Remove use of > >>>> aarch64_use_new_vector_costs_p. > >>>> * config/aarch64/tuning_models/cortexx925.h: Remove > >>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS. > >>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise. > >>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise. > >>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise. > >>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise. > >>>> * config/aarch64/tuning_models/neoversen2.h: Likewise. > >>>> * config/aarch64/tuning_models/neoversen3.h: Likewise. > >>>> * config/aarch64/tuning_models/neoversev1.h: Likewise. > >>>> * config/aarch64/tuning_models/neoversev2.h: Likewise. > >>>> * config/aarch64/tuning_models/neoversev3.h: Likewise. > >>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise. > >>>> > >>>> gcc/testsuite/ > >>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome. > >>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise. > >>>> --- > >>>> gcc/config/aarch64/aarch64-tuning-flags.def | 2 - > >>>> gcc/config/aarch64/aarch64.cc | 20 ++-------- > >>>> gcc/config/aarch64/tuning_models/cortexx925.h | 1 - > >>>> .../aarch64/tuning_models/fujitsu_monaka.h | 1 - > >>>> .../aarch64/tuning_models/generic_armv8_a.h | 1 - > >>>> .../aarch64/tuning_models/generic_armv9_a.h | 1 - > >>>> .../aarch64/tuning_models/neoverse512tvb.h | 1 - > >>>> gcc/config/aarch64/tuning_models/neoversen2.h | 1 - > >>>> gcc/config/aarch64/tuning_models/neoversen3.h | 1 - > >>>> gcc/config/aarch64/tuning_models/neoversev1.h | 1 - > >>>> gcc/config/aarch64/tuning_models/neoversev2.h | 1 - > >>>> gcc/config/aarch64/tuning_models/neoversev3.h | 1 - > >>>> .../aarch64/tuning_models/neoversev3ae.h | 1 - > >>>> .../gcc.target/aarch64/sve/strided_load_2.c | 2 +- > >>>> .../gcc.target/aarch64/sve/strided_store_2.c | 2 +- > >>>> gcc/tree-vect-stmts.cc | 37 +++++++++++-------- > >>>> 16 files changed, 27 insertions(+), 47 deletions(-) > >>>> > >>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def > >>>> b/gcc/config/aarch64/aarch64-tuning-flags.def > >>>> index ffbff20e29c..1de633c739b 100644 > >>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def > >>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def > >>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", > >>>> CHEAP_SHIFT_EXTEND) > >>>> > >>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", > >>>> CSE_SVE_VL_CONSTANTS) > >>>> > >>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", > >>>> USE_NEW_VECTOR_COSTS) > >>>> - > >>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", > >>>> MATCHED_VECTOR_THROUGHPUT) > >>>> > >>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", > >>>> AVOID_CROSS_LOOP_FMA) > >>>> diff --git a/gcc/config/aarch64/aarch64.cc > >>>> b/gcc/config/aarch64/aarch64.cc > >>>> index 77a2a6bfa3a..71fba9cc63b 100644 > >>>> --- a/gcc/config/aarch64/aarch64.cc > >>>> +++ b/gcc/config/aarch64/aarch64.cc > >>>> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info > >>>> *vinfo, bool costing_for_scalar) > >>>> return new aarch64_vector_costs (vinfo, costing_for_scalar); > >>>> } > >>>> > >>>> -/* Return true if the current CPU should use the new costs defined > >>>> - in GCC 11. This should be removed for GCC 12 and above, with the > >>>> - costs applying to all CPUs instead. */ > >>>> -static bool > >>>> -aarch64_use_new_vector_costs_p () > >>>> -{ > >>>> - return (aarch64_tune_params.extra_tuning_flags > >>>> - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS); > >>>> -} > >>>> - > >>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE. */ > >>>> static const simd_vec_cost * > >>>> aarch64_simd_vec_costs (tree vectype) > >>>> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count, > >>>> vect_cost_for_stmt kind, > >>>> > >>>> /* Do one-time initialization based on the vinfo. */ > >>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); > >>>> - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ()) > >>>> + if (!m_analyzed_vinfo) > >>>> { > >>>> if (loop_vinfo) > >>>> analyze_loop_vinfo (loop_vinfo); > >>>> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count, > >>>> vect_cost_for_stmt kind, > >>>> > >>>> /* Try to get a more accurate cost by looking at STMT_INFO instead > >>>> of just looking at KIND. */ > >>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) > >>>> + if (stmt_info) > >>>> { > >>>> /* If we scalarize a strided store, the vectorizer costs one > >>>> vec_to_scalar for each element. However, we can store the first > >>>> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count, > >>>> vect_cost_for_stmt kind, > >>>> else > >>>> m_num_last_promote_demote = 0; > >>>> > >>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) > >>>> + if (stmt_info) > >>>> { > >>>> /* Account for any extra "embedded" costs that apply additively > >>>> to the base cost calculated above. */ > >>>> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const > >>>> vector_costs *uncast_scalar_costs) > >>>> > >>>> auto *scalar_costs > >>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs); > >>>> - if (loop_vinfo > >>>> - && m_vec_flags > >>>> - && aarch64_use_new_vector_costs_p ()) > >>>> + if (loop_vinfo && m_vec_flags) > >>>> { > >>>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, > >>>> m_costs[vect_body]); > >>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h > >>>> b/gcc/config/aarch64/tuning_models/cortexx925.h > >>>> index 5ebaf66e986..74772f3e15f 100644 > >>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h > >>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h > >>>> @@ -221,7 +221,6 @@ static const struct tune_params cortexx925_tunings = > >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>> (AARCH64_EXTRA_TUNE_BASE > >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > >>>> &generic_armv9a_prefetch_tune, > >>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > >>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > >>>> index 2d704ecd110..a564528f43d 100644 > >>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > >>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > >>>> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings > >>>> = > >>>> 0, /* max_case_values. */ > >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>> (AARCH64_EXTRA_TUNE_BASE > >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > >>>> &generic_prefetch_tune, > >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > >>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h > >>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > >>>> index bdd309ab03d..f090d5cde50 100644 > >>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h > >>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > >>>> @@ -183,7 +183,6 @@ static const struct tune_params > >>>> generic_armv8_a_tunings = > >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>> (AARCH64_EXTRA_TUNE_BASE > >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > >>>> &generic_prefetch_tune, > >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > >>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h > >>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > >>>> index 785e00946bc..7b5821183bc 100644 > >>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h > >>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > >>>> @@ -251,7 +251,6 @@ static const struct tune_params > >>>> generic_armv9_a_tunings = > >>>> 0, /* max_case_values. */ > >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>> (AARCH64_EXTRA_TUNE_BASE > >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > >>>> &generic_armv9a_prefetch_tune, > >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > >>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h > >>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > >>>> index 007f987154c..f7457df59e5 100644 > >>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h > >>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > >>>> @@ -156,7 +156,6 @@ static const struct tune_params > >>>> neoverse512tvb_tunings = > >>>> 0, /* max_case_values. */ > >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > >>>> &generic_armv9a_prefetch_tune, > >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h > >>>> b/gcc/config/aarch64/tuning_models/neoversen2.h > >>>> index 32560d2f5f8..541b61c8179 100644 > >>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h > >>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h > >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings = > >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>> (AARCH64_EXTRA_TUNE_BASE > >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > >>>> &generic_armv9a_prefetch_tune, > >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h > >>>> b/gcc/config/aarch64/tuning_models/neoversen3.h > >>>> index 2010bc4645b..eff668132a8 100644 > >>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h > >>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h > >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings = > >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>> (AARCH64_EXTRA_TUNE_BASE > >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > >>>> &generic_armv9a_prefetch_tune, > >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h > >>>> b/gcc/config/aarch64/tuning_models/neoversev1.h > >>>> index c3751e32696..d11472b6e1e 100644 > >>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h > >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h > >>>> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings = > >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>> (AARCH64_EXTRA_TUNE_BASE > >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > >>>> &generic_armv9a_prefetch_tune, > >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h > >>>> b/gcc/config/aarch64/tuning_models/neoversev2.h > >>>> index 80dbe5c806c..ee77ffdd3bc 100644 > >>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h > >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h > >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev2_tunings = > >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>> (AARCH64_EXTRA_TUNE_BASE > >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW > >>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */ > >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h > >>>> b/gcc/config/aarch64/tuning_models/neoversev3.h > >>>> index efe09e16d1e..6ef143ef7d5 100644 > >>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h > >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h > >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings = > >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>> (AARCH64_EXTRA_TUNE_BASE > >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > >>>> &generic_armv9a_prefetch_tune, > >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h > >>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h > >>>> index 66849f30889..96bdbf971f1 100644 > >>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h > >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h > >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings > >>>> = > >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>> (AARCH64_EXTRA_TUNE_BASE > >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > >>>> &generic_armv9a_prefetch_tune, > >>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > >>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > >>>> index 762805ff54b..c334b7a6875 100644 > >>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > >>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > >>>> @@ -15,4 +15,4 @@ > >>>> so we vectorize the offset calculation. This means that the > >>>> 64-bit version needs two copies. */ > >>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, > >>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ > >>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, > >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ > >>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, > >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ > >>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > >>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > >>>> index f0ea58e38e2..94cc63049bc 100644 > >>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > >>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > >>>> @@ -15,4 +15,4 @@ > >>>> so we vectorize the offset calculation. This means that the > >>>> 64-bit version needs two copies. */ > >>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], > >>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ > >>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], > >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ > >>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], > >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ > >>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > >>>> index be1139a423c..ab57163c243 100644 > >>>> --- a/gcc/tree-vect-stmts.cc > >>>> +++ b/gcc/tree-vect-stmts.cc > >>>> @@ -8834,19 +8834,8 @@ vectorizable_store (vec_info *vinfo, > >>>> { > >>>> if (costing_p) > >>>> { > >>>> - /* Only need vector extracting when there are more > >>>> - than one stores. */ > >>>> - if (nstores > 1) > >>>> - inside_cost > >>>> - += record_stmt_cost (cost_vec, 1, vec_to_scalar, > >>>> - stmt_info, slp_node, > >>>> - 0, vect_body); > >>>> - /* Take a single lane vector type store as scalar > >>>> - store to avoid ICE like 110776. */ > >>>> - if (VECTOR_TYPE_P (ltype) > >>>> - && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) > >>>> - n_adjacent_stores++; > >>>> - else > >>>> + n_adjacent_stores++; > >>>> + if (!VECTOR_TYPE_P (ltype)) > >>> > >>> This should be combined with the Single lane Vector case belle > >>> > >>>> inside_cost > >>>> += record_stmt_cost (cost_vec, 1, scalar_store, > >>>> stmt_info, 0, vect_body); > >>>> @@ -8905,9 +8894,25 @@ vectorizable_store (vec_info *vinfo, > >>>> if (costing_p) > >>>> { > >>>> if (n_adjacent_stores > 0) > >>>> - vect_get_store_cost (vinfo, stmt_info, slp_node, > >>>> n_adjacent_stores, > >>>> - alignment_support_scheme, misalignment, > >>>> - &inside_cost, cost_vec); > >>>> + { > >>>> + /* Take a single lane vector type store as scalar > >>>> + store to avoid ICE like 110776. */ > >>>> + if (VECTOR_TYPE_P (ltype) > >>>> + && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) > >>>> + inside_cost > >>>> + += record_stmt_cost (cost_vec, n_adjacent_stores, > >>>> + scalar_store, stmt_info, 0, vect_body); > >>>> + /* Only need vector extracting when there are more > >>>> + than one stores. */ > >>>> + if (nstores > 1) > >>>> + inside_cost > >>>> + += record_stmt_cost (cost_vec, n_adjacent_stores, > >>>> + vec_to_scalar, stmt_info, slp_node, > >>>> + 0, vect_body); > >>>> + vect_get_store_cost (vinfo, stmt_info, slp_node, > >>> > >>> This should be Inlay done for Multi-lane vectors > >> Thanks for the quick reply. As I am making the changes, I am wondering: Do > >> we even need n_adjacent_stores anymore? It appears to always have the same > >> value as nstores. Can we remove it and use nstores instead or does it > >> still serve another purpose? > > > > It was a heuristic needed for powerpc(?), can you confirm we’re not > > combining stores from VF unrolling for strided SLP stores? > Hi Richard, > the reasoning behind my suggestion to replace n_adjacent_stores by nstores in > this code section is that with my patch they will logically always have the > same value. > > Having said that, I looked into why n_adjacent_stores was introduced in the > first place: The patch [1] that introduced n_adjacent_stores fixed a > regression on aarch64 by costing vector loads/stores together. The variables > n_adjacent_stores and n_adjacent_loads were added in two code sections each > in vectorizable_store and vectorizable_load. The connection to PowerPC you > recalled is also mentioned in the PR, but I believe it refers to the enum > dr_alignment_support alignment_support_scheme that is used in > > vect_get_store_cost (vinfo, stmt_info, slp_node, > _adjacent_stores, alignment_support_scheme, > misalignment, &inside_cost, cost_vec); > > to which I made no changes other than refactoring the if-statement around it. > > So, taking the fact that n_adjacent_stores has been introduced in multiple > locations into account I would actually leave n_adjacent_stores in the code > section that I made changes to in order to keep vectorizable_store and > vectorizable_load consistent. > > Regarding your question about not combining stores from loop unrolling for > strided SLP stores: I'm not entirely sure what you mean, but could it be > covered by the tests gcc.target/aarch64/ldp_stp_* that were also mentioned in > [1]?
I'm refering to a case with variable stride for (.. i += s) { a[4*i] = ..; a[4*i + 1] = ...; a[4*i + 2] = ...; a[4*i + 3] = ...; } where we might choose to store to the V4SI destination using two V2SI stores (adjacent), iff the VF ends up equal two we'd have two sets of a[] stores, thus four V2SI stores but only two of them would be "adjacent". Note I don't know whether "adjacent" really was supposed to be adjacent or rather "related". Anyway, the costing interface for loads and stores is likely to change sustantially for GCC 16. > I added the changes you proposed in the updated patch below, but kept > n_adjacent_stores. The patch was re-validated on aarch64. > Thanks, > Jennifer > > [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111784#c3 > > > This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and > use_new_vector_costs entry in aarch64-tuning-flags.def and makes the > AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the > default. To that end, the function aarch64_use_new_vector_costs_p and its uses > were removed. To prevent costing vec_to_scalar operations with 0, as > described in > https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html, > we adjusted vectorizable_store such that the variable n_adjacent_stores > also covers vec_to_scalar operations. This way vec_to_scalar operations > are not costed individually, but as a group. > > Two tests were adjusted due to changes in codegen. In both cases, the > old code performed loop unrolling once, but the new code does not: > Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with > -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic > -moverride=tune=none): > f_int64_t_32: > cbz w3, .L92 > mov x4, 0 > uxtw x3, w3 > + cntd x5 > + whilelo p7.d, xzr, x3 > + mov z29.s, w5 > mov z31.s, w2 > - whilelo p6.d, xzr, x3 > - mov x2, x3 > - index z30.s, #0, #1 > - uqdecd x2 > - ptrue p5.b, all > - whilelo p7.d, xzr, x2 > + index z30.d, #0, #1 > + ptrue p6.b, all > .p2align 3,,7 > .L94: > - ld1d z27.d, p7/z, [x0, #1, mul vl] > - ld1d z28.d, p6/z, [x0] > - movprfx z29, z31 > - mul z29.s, p5/m, z29.s, z30.s > - incw x4 > - uunpklo z0.d, z29.s > - uunpkhi z29.d, z29.s > - ld1d z25.d, p6/z, [x1, z0.d, lsl 3] > - ld1d z26.d, p7/z, [x1, z29.d, lsl 3] > - add z25.d, z28.d, z25.d > + ld1d z27.d, p7/z, [x0, x4, lsl 3] > + movprfx z28, z31 > + mul z28.s, p6/m, z28.s, z30.s > + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3] > add z26.d, z27.d, z26.d > - st1d z26.d, p7, [x0, #1, mul vl] > - whilelo p7.d, x4, x2 > - st1d z25.d, p6, [x0] > - incw z30.s > - incb x0, all, mul #2 > - whilelo p6.d, x4, x3 > + st1d z26.d, p7, [x0, x4, lsl 3] > + add z30.s, z30.s, z29.s > + incd x4 > + whilelo p7.d, x4, x3 > b.any .L94 > .L92: > ret > > Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with > -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic > -moverride=tune=none): > f_int64_t_32: > cbz w3, .L84 > - addvl x5, x1, #1 > mov x4, 0 > uxtw x3, w3 > - mov z31.s, w2 > + cntd x5 > whilelo p7.d, xzr, x3 > - mov x2, x3 > - index z30.s, #0, #1 > - uqdecd x2 > - ptrue p5.b, all > - whilelo p6.d, xzr, x2 > + mov z29.s, w5 > + mov z31.s, w2 > + index z30.d, #0, #1 > + ptrue p6.b, all > .p2align 3,,7 > .L86: > - ld1d z28.d, p7/z, [x1, x4, lsl 3] > - ld1d z27.d, p6/z, [x5, x4, lsl 3] > - movprfx z29, z30 > - mul z29.s, p5/m, z29.s, z31.s > - add z28.d, z28.d, #1 > - uunpklo z26.d, z29.s > - st1d z28.d, p7, [x0, z26.d, lsl 3] > - incw x4 > - uunpkhi z29.d, z29.s > + ld1d z27.d, p7/z, [x1, x4, lsl 3] > + movprfx z28, z30 > + mul z28.s, p6/m, z28.s, z31.s > add z27.d, z27.d, #1 > - whilelo p6.d, x4, x2 > - st1d z27.d, p7, [x0, z29.d, lsl 3] > - incw z30.s > + st1d z27.d, p7, [x0, z28.d, uxtw 3] > + incd x4 > + add z30.s, z30.s, z29.s > whilelo p7.d, x4, x3 > b.any .L86 > .L84: > ret > > The patch was bootstrapped and tested on aarch64-linux-gnu, no > regression. > OK for mainline? LGTM. Richard. > Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> > > gcc/ > * tree-vect-stmts.cc (vectorizable_store): Extend the use of > n_adjacent_stores to also cover vec_to_scalar operations. > * config/aarch64/aarch64-tuning-flags.def: Remove > use_new_vector_costs as tuning option. > * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p): > Remove. > (aarch64_vector_costs::add_stmt_cost): Remove use of > aarch64_use_new_vector_costs_p. > (aarch64_vector_costs::finish_cost): Remove use of > aarch64_use_new_vector_costs_p. > * config/aarch64/tuning_models/cortexx925.h: Remove > AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS. > * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise. > * config/aarch64/tuning_models/generic_armv8_a.h: Likewise. > * config/aarch64/tuning_models/generic_armv9_a.h: Likewise. > * config/aarch64/tuning_models/neoverse512tvb.h: Likewise. > * config/aarch64/tuning_models/neoversen2.h: Likewise. > * config/aarch64/tuning_models/neoversen3.h: Likewise. > * config/aarch64/tuning_models/neoversev1.h: Likewise. > * config/aarch64/tuning_models/neoversev2.h: Likewise. > * config/aarch64/tuning_models/neoversev3.h: Likewise. > * config/aarch64/tuning_models/neoversev3ae.h: Likewise. > > gcc/testsuite/ > * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome. > * gcc.target/aarch64/sve/strided_store_2.c: Likewise. > --- > gcc/config/aarch64/aarch64-tuning-flags.def | 2 - > gcc/config/aarch64/aarch64.cc | 20 ++-------- > gcc/config/aarch64/tuning_models/cortexx925.h | 1 - > .../aarch64/tuning_models/fujitsu_monaka.h | 1 - > .../aarch64/tuning_models/generic_armv8_a.h | 1 - > .../aarch64/tuning_models/generic_armv9_a.h | 1 - > .../aarch64/tuning_models/neoverse512tvb.h | 1 - > gcc/config/aarch64/tuning_models/neoversen2.h | 1 - > gcc/config/aarch64/tuning_models/neoversen3.h | 1 - > gcc/config/aarch64/tuning_models/neoversev1.h | 1 - > gcc/config/aarch64/tuning_models/neoversev2.h | 1 - > gcc/config/aarch64/tuning_models/neoversev3.h | 1 - > .../aarch64/tuning_models/neoversev3ae.h | 1 - > .../gcc.target/aarch64/sve/strided_load_2.c | 2 +- > .../gcc.target/aarch64/sve/strided_store_2.c | 2 +- > gcc/tree-vect-stmts.cc | 40 ++++++++++--------- > 16 files changed, 27 insertions(+), 50 deletions(-) > > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def > b/gcc/config/aarch64/aarch64-tuning-flags.def > index ffbff20e29c..1de633c739b 100644 > --- a/gcc/config/aarch64/aarch64-tuning-flags.def > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def > @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", > CHEAP_SHIFT_EXTEND) > > AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS) > > -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS) > - > AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", > MATCHED_VECTOR_THROUGHPUT) > > AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA) > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index 77a2a6bfa3a..71fba9cc63b 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo, > bool costing_for_scalar) > return new aarch64_vector_costs (vinfo, costing_for_scalar); > } > > -/* Return true if the current CPU should use the new costs defined > - in GCC 11. This should be removed for GCC 12 and above, with the > - costs applying to all CPUs instead. */ > -static bool > -aarch64_use_new_vector_costs_p () > -{ > - return (aarch64_tune_params.extra_tuning_flags > - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS); > -} > - > /* Return the appropriate SIMD costs for vectors of type VECTYPE. */ > static const simd_vec_cost * > aarch64_simd_vec_costs (tree vectype) > @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count, > vect_cost_for_stmt kind, > > /* Do one-time initialization based on the vinfo. */ > loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); > - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ()) > + if (!m_analyzed_vinfo) > { > if (loop_vinfo) > analyze_loop_vinfo (loop_vinfo); > @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count, > vect_cost_for_stmt kind, > > /* Try to get a more accurate cost by looking at STMT_INFO instead > of just looking at KIND. */ > - if (stmt_info && aarch64_use_new_vector_costs_p ()) > + if (stmt_info) > { > /* If we scalarize a strided store, the vectorizer costs one > vec_to_scalar for each element. However, we can store the first > @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count, > vect_cost_for_stmt kind, > else > m_num_last_promote_demote = 0; > > - if (stmt_info && aarch64_use_new_vector_costs_p ()) > + if (stmt_info) > { > /* Account for any extra "embedded" costs that apply additively > to the base cost calculated above. */ > @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const vector_costs > *uncast_scalar_costs) > > auto *scalar_costs > = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs); > - if (loop_vinfo > - && m_vec_flags > - && aarch64_use_new_vector_costs_p ()) > + if (loop_vinfo && m_vec_flags) > { > m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, > m_costs[vect_body]); > diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h > b/gcc/config/aarch64/tuning_models/cortexx925.h > index 5ebaf66e986..74772f3e15f 100644 > --- a/gcc/config/aarch64/tuning_models/cortexx925.h > +++ b/gcc/config/aarch64/tuning_models/cortexx925.h > @@ -221,7 +221,6 @@ static const struct tune_params cortexx925_tunings = > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_BASE > | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > &generic_armv9a_prefetch_tune, > diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > index 2d704ecd110..a564528f43d 100644 > --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings = > 0, /* max_case_values. */ > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_BASE > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > &generic_prefetch_tune, > AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h > b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > index bdd309ab03d..f090d5cde50 100644 > --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h > +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > @@ -183,7 +183,6 @@ static const struct tune_params generic_armv8_a_tunings = > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_BASE > | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > &generic_prefetch_tune, > AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h > b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > index 785e00946bc..7b5821183bc 100644 > --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h > +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > @@ -251,7 +251,6 @@ static const struct tune_params generic_armv9_a_tunings = > 0, /* max_case_values. */ > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_BASE > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > &generic_armv9a_prefetch_tune, > AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h > b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > index 007f987154c..f7457df59e5 100644 > --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h > +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > @@ -156,7 +156,6 @@ static const struct tune_params neoverse512tvb_tunings = > 0, /* max_case_values. */ > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > &generic_armv9a_prefetch_tune, > AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h > b/gcc/config/aarch64/tuning_models/neoversen2.h > index 32560d2f5f8..541b61c8179 100644 > --- a/gcc/config/aarch64/tuning_models/neoversen2.h > +++ b/gcc/config/aarch64/tuning_models/neoversen2.h > @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings = > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_BASE > | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > &generic_armv9a_prefetch_tune, > diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h > b/gcc/config/aarch64/tuning_models/neoversen3.h > index 2010bc4645b..eff668132a8 100644 > --- a/gcc/config/aarch64/tuning_models/neoversen3.h > +++ b/gcc/config/aarch64/tuning_models/neoversen3.h > @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings = > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_BASE > | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > &generic_armv9a_prefetch_tune, > AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h > b/gcc/config/aarch64/tuning_models/neoversev1.h > index c3751e32696..d11472b6e1e 100644 > --- a/gcc/config/aarch64/tuning_models/neoversev1.h > +++ b/gcc/config/aarch64/tuning_models/neoversev1.h > @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings = > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_BASE > | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > &generic_armv9a_prefetch_tune, > diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h > b/gcc/config/aarch64/tuning_models/neoversev2.h > index 80dbe5c806c..ee77ffdd3bc 100644 > --- a/gcc/config/aarch64/tuning_models/neoversev2.h > +++ b/gcc/config/aarch64/tuning_models/neoversev2.h > @@ -219,7 +219,6 @@ static const struct tune_params neoversev2_tunings = > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_BASE > | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW > | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */ > diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h > b/gcc/config/aarch64/tuning_models/neoversev3.h > index efe09e16d1e..6ef143ef7d5 100644 > --- a/gcc/config/aarch64/tuning_models/neoversev3.h > +++ b/gcc/config/aarch64/tuning_models/neoversev3.h > @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings = > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_BASE > | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > &generic_armv9a_prefetch_tune, > diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h > b/gcc/config/aarch64/tuning_models/neoversev3ae.h > index 66849f30889..96bdbf971f1 100644 > --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h > +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h > @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings = > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_BASE > | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > &generic_armv9a_prefetch_tune, > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > index 762805ff54b..c334b7a6875 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > @@ -15,4 +15,4 @@ > so we vectorize the offset calculation. This means that the > 64-bit version needs two copies. */ > /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, > \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ > -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, > \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ > +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, > \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > index f0ea58e38e2..94cc63049bc 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > @@ -15,4 +15,4 @@ > so we vectorize the offset calculation. This means that the > 64-bit version needs two copies. */ > /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], \[x[0-9]+, > z[0-9]+.s, uxtw 2\]\n} 3 } } */ > -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[x[0-9]+, > z[0-9]+.d, lsl 3\]\n} 15 } } */ > +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[x[0-9]+, > z[0-9]+.d, lsl 3\]\n} 9 } } */ > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > index be1139a423c..a14248193ca 100644 > --- a/gcc/tree-vect-stmts.cc > +++ b/gcc/tree-vect-stmts.cc > @@ -8834,22 +8834,7 @@ vectorizable_store (vec_info *vinfo, > { > if (costing_p) > { > - /* Only need vector extracting when there are more > - than one stores. */ > - if (nstores > 1) > - inside_cost > - += record_stmt_cost (cost_vec, 1, vec_to_scalar, > - stmt_info, slp_node, > - 0, vect_body); > - /* Take a single lane vector type store as scalar > - store to avoid ICE like 110776. */ > - if (VECTOR_TYPE_P (ltype) > - && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) > - n_adjacent_stores++; > - else > - inside_cost > - += record_stmt_cost (cost_vec, 1, scalar_store, > - stmt_info, 0, vect_body); > + n_adjacent_stores++; > continue; > } > tree newref, newoff; > @@ -8905,9 +8890,26 @@ vectorizable_store (vec_info *vinfo, > if (costing_p) > { > if (n_adjacent_stores > 0) > - vect_get_store_cost (vinfo, stmt_info, slp_node, > n_adjacent_stores, > - alignment_support_scheme, misalignment, > - &inside_cost, cost_vec); > + { > + /* Take a single lane vector type store as scalar > + store to avoid ICE like 110776. */ > + if (VECTOR_TYPE_P (ltype) > + && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) > + vect_get_store_cost (vinfo, stmt_info, slp_node, > + n_adjacent_stores, > alignment_support_scheme, > + misalignment, &inside_cost, cost_vec); > + else > + inside_cost > + += record_stmt_cost (cost_vec, n_adjacent_stores, > + scalar_store, stmt_info, 0, vect_body); > + /* Only need vector extracting when there are more > + than one stores. */ > + if (nstores > 1) > + inside_cost > + += record_stmt_cost (cost_vec, n_adjacent_stores, > + vec_to_scalar, stmt_info, slp_node, > + 0, vect_body); > + } > if (dump_enabled_p ()) > dump_printf_loc (MSG_NOTE, vect_location, > "vect_model_store_cost: inside_cost = %d, " > -- > 2.44.0 > > > >> Thanks, Jennifer > >>> > >>>> + n_adjacent_stores, alignment_support_scheme, > >>>> + misalignment, &inside_cost, cost_vec); > >>>> + } > >>>> if (dump_enabled_p ()) > >>>> dump_printf_loc (MSG_NOTE, vect_location, > >>>> "vect_model_store_cost: inside_cost = %d, " > >>>> -- > >>>> 2.34.1 > >>>>> > >>>>>> + inside_cost > >>>>>> + += record_stmt_cost (cost_vec, n_adjacent_stores, > >>>>>> vec_to_scalar, > >>>>>> + stmt_info, slp_node, > >>>>>> + 0, vect_body); > >>>>>> + } > >>>>>> if (dump_enabled_p ()) > >>>>>> dump_printf_loc (MSG_NOTE, vect_location, > >>>>>> "vect_model_store_cost: inside_cost = %d, " > >>>>>> -- > >>>>>> 2.44.0 > >>>>>> > >>>>>> > >>>>>>>> > >>>>>>>> Richard > >>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> Jennifer > >>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> Jennifer > >>>>>>>>>>> > >>>>>>>>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>>>>>>> tunable and > >>>>>>>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes > >>>>>>>>>>> the > >>>>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the > >>>>>>>>>>> default. To that end, the function aarch64_use_new_vector_costs_p > >>>>>>>>>>> and its uses > >>>>>>>>>>> were removed. To prevent costing vec_to_scalar operations with 0, > >>>>>>>>>>> as > >>>>>>>>>>> described in > >>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html, > >>>>>>>>>>> we guarded the call to vect_is_store_elt_extraction in > >>>>>>>>>>> aarch64_vector_costs::add_stmt_cost by count > 1. > >>>>>>>>>>> > >>>>>>>>>>> Two tests were adjusted due to changes in codegen. In both cases, > >>>>>>>>>>> the > >>>>>>>>>>> old code performed loop unrolling once, but the new code does not: > >>>>>>>>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled > >>>>>>>>>>> with > >>>>>>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic > >>>>>>>>>>> -moverride=tune=none): > >>>>>>>>>>> f_int64_t_32: > >>>>>>>>>>> cbz w3, .L92 > >>>>>>>>>>> mov x4, 0 > >>>>>>>>>>> uxtw x3, w3 > >>>>>>>>>>> + cntd x5 > >>>>>>>>>>> + whilelo p7.d, xzr, x3 > >>>>>>>>>>> + mov z29.s, w5 > >>>>>>>>>>> mov z31.s, w2 > >>>>>>>>>>> - whilelo p6.d, xzr, x3 > >>>>>>>>>>> - mov x2, x3 > >>>>>>>>>>> - index z30.s, #0, #1 > >>>>>>>>>>> - uqdecd x2 > >>>>>>>>>>> - ptrue p5.b, all > >>>>>>>>>>> - whilelo p7.d, xzr, x2 > >>>>>>>>>>> + index z30.d, #0, #1 > >>>>>>>>>>> + ptrue p6.b, all > >>>>>>>>>>> .p2align 3,,7 > >>>>>>>>>>> .L94: > >>>>>>>>>>> - ld1d z27.d, p7/z, [x0, #1, mul vl] > >>>>>>>>>>> - ld1d z28.d, p6/z, [x0] > >>>>>>>>>>> - movprfx z29, z31 > >>>>>>>>>>> - mul z29.s, p5/m, z29.s, z30.s > >>>>>>>>>>> - incw x4 > >>>>>>>>>>> - uunpklo z0.d, z29.s > >>>>>>>>>>> - uunpkhi z29.d, z29.s > >>>>>>>>>>> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3] > >>>>>>>>>>> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3] > >>>>>>>>>>> - add z25.d, z28.d, z25.d > >>>>>>>>>>> + ld1d z27.d, p7/z, [x0, x4, lsl 3] > >>>>>>>>>>> + movprfx z28, z31 > >>>>>>>>>>> + mul z28.s, p6/m, z28.s, z30.s > >>>>>>>>>>> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3] > >>>>>>>>>>> add z26.d, z27.d, z26.d > >>>>>>>>>>> - st1d z26.d, p7, [x0, #1, mul vl] > >>>>>>>>>>> - whilelo p7.d, x4, x2 > >>>>>>>>>>> - st1d z25.d, p6, [x0] > >>>>>>>>>>> - incw z30.s > >>>>>>>>>>> - incb x0, all, mul #2 > >>>>>>>>>>> - whilelo p6.d, x4, x3 > >>>>>>>>>>> + st1d z26.d, p7, [x0, x4, lsl 3] > >>>>>>>>>>> + add z30.s, z30.s, z29.s > >>>>>>>>>>> + incd x4 > >>>>>>>>>>> + whilelo p7.d, x4, x3 > >>>>>>>>>>> b.any .L94 > >>>>>>>>>>> .L92: > >>>>>>>>>>> ret > >>>>>>>>>>> > >>>>>>>>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled > >>>>>>>>>>> with > >>>>>>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic > >>>>>>>>>>> -moverride=tune=none): > >>>>>>>>>>> f_int64_t_32: > >>>>>>>>>>> cbz w3, .L84 > >>>>>>>>>>> - addvl x5, x1, #1 > >>>>>>>>>>> mov x4, 0 > >>>>>>>>>>> uxtw x3, w3 > >>>>>>>>>>> - mov z31.s, w2 > >>>>>>>>>>> + cntd x5 > >>>>>>>>>>> whilelo p7.d, xzr, x3 > >>>>>>>>>>> - mov x2, x3 > >>>>>>>>>>> - index z30.s, #0, #1 > >>>>>>>>>>> - uqdecd x2 > >>>>>>>>>>> - ptrue p5.b, all > >>>>>>>>>>> - whilelo p6.d, xzr, x2 > >>>>>>>>>>> + mov z29.s, w5 > >>>>>>>>>>> + mov z31.s, w2 > >>>>>>>>>>> + index z30.d, #0, #1 > >>>>>>>>>>> + ptrue p6.b, all > >>>>>>>>>>> .p2align 3,,7 > >>>>>>>>>>> .L86: > >>>>>>>>>>> - ld1d z28.d, p7/z, [x1, x4, lsl 3] > >>>>>>>>>>> - ld1d z27.d, p6/z, [x5, x4, lsl 3] > >>>>>>>>>>> - movprfx z29, z30 > >>>>>>>>>>> - mul z29.s, p5/m, z29.s, z31.s > >>>>>>>>>>> - add z28.d, z28.d, #1 > >>>>>>>>>>> - uunpklo z26.d, z29.s > >>>>>>>>>>> - st1d z28.d, p7, [x0, z26.d, lsl 3] > >>>>>>>>>>> - incw x4 > >>>>>>>>>>> - uunpkhi z29.d, z29.s > >>>>>>>>>>> + ld1d z27.d, p7/z, [x1, x4, lsl 3] > >>>>>>>>>>> + movprfx z28, z30 > >>>>>>>>>>> + mul z28.s, p6/m, z28.s, z31.s > >>>>>>>>>>> add z27.d, z27.d, #1 > >>>>>>>>>>> - whilelo p6.d, x4, x2 > >>>>>>>>>>> - st1d z27.d, p7, [x0, z29.d, lsl 3] > >>>>>>>>>>> - incw z30.s > >>>>>>>>>>> + st1d z27.d, p7, [x0, z28.d, uxtw 3] > >>>>>>>>>>> + incd x4 > >>>>>>>>>>> + add z30.s, z30.s, z29.s > >>>>>>>>>>> whilelo p7.d, x4, x3 > >>>>>>>>>>> b.any .L86 > >>>>>>>>>>> .L84: > >>>>>>>>>>> ret > >>>>>>>>>>> > >>>>>>>>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no > >>>>>>>>>>> regression. We also ran SPEC2017 with -mcpu=generic on a Grace > >>>>>>>>>>> machine and saw > >>>>>>>>>>> no non-noise impact on performance. We would appreciate help with > >>>>>>>>>>> wider > >>>>>>>>>>> benchmarking on other platforms, if necessary. > >>>>>>>>>>> OK for mainline? > >>>>>>>>>>> > >>>>>>>>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> > >>>>>>>>>>> > >>>>>>>>>>> gcc/ > >>>>>>>>>>> * config/aarch64/aarch64-tuning-flags.def: Remove > >>>>>>>>>>> use_new_vector_costs as tuning option. > >>>>>>>>>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p): > >>>>>>>>>>> Remove. > >>>>>>>>>>> (aarch64_vector_costs::add_stmt_cost): Remove use of > >>>>>>>>>>> aarch64_use_new_vector_costs_p and guard call to > >>>>>>>>>>> vect_is_store_elt_extraction with count > 1. > >>>>>>>>>>> (aarch64_vector_costs::finish_cost): Remove use of > >>>>>>>>>>> aarch64_use_new_vector_costs_p. > >>>>>>>>>>> * config/aarch64/tuning_models/cortexx925.h: Remove > >>>>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS. > >>>>>>>>>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise. > >>>>>>>>>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise. > >>>>>>>>>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise. > >>>>>>>>>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise. > >>>>>>>>>>> * config/aarch64/tuning_models/neoversen2.h: Likewise. > >>>>>>>>>>> * config/aarch64/tuning_models/neoversen3.h: Likewise. > >>>>>>>>>>> * config/aarch64/tuning_models/neoversev1.h: Likewise. > >>>>>>>>>>> * config/aarch64/tuning_models/neoversev2.h: Likewise. > >>>>>>>>>>> * config/aarch64/tuning_models/neoversev3.h: Likewise. > >>>>>>>>>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise. > >>>>>>>>>>> > >>>>>>>>>>> gcc/testsuite/ > >>>>>>>>>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected > >>>>>>>>>>> outcome. > >>>>>>>>>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise. > >>>>>>>>>>> --- > >>>>>>>>>>> gcc/config/aarch64/aarch64-tuning-flags.def | 2 -- > >>>>>>>>>>> gcc/config/aarch64/aarch64.cc | 22 > >>>>>>>>>>> +++++-------------- > >>>>>>>>>>> gcc/config/aarch64/tuning_models/cortexx925.h | 1 - > >>>>>>>>>>> .../aarch64/tuning_models/fujitsu_monaka.h | 1 - > >>>>>>>>>>> .../aarch64/tuning_models/generic_armv8_a.h | 1 - > >>>>>>>>>>> .../aarch64/tuning_models/generic_armv9_a.h | 1 - > >>>>>>>>>>> .../aarch64/tuning_models/neoverse512tvb.h | 1 - > >>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversen2.h | 1 - > >>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversen3.h | 1 - > >>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev1.h | 1 - > >>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev2.h | 1 - > >>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev3.h | 1 - > >>>>>>>>>>> .../aarch64/tuning_models/neoversev3ae.h | 1 - > >>>>>>>>>>> .../gcc.target/aarch64/sve/strided_load_2.c | 2 +- > >>>>>>>>>>> .../gcc.target/aarch64/sve/strided_store_2.c | 2 +- > >>>>>>>>>>> 15 files changed, 7 insertions(+), 32 deletions(-) > >>>>>>>>>>> > >>>>>>>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def > >>>>>>>>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def > >>>>>>>>>>> index 5939602576b..ed345b13ed3 100644 > >>>>>>>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def > >>>>>>>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def > >>>>>>>>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION > >>>>>>>>>>> ("cheap_shift_extend", CHEAP_SHIFT_EXTEND) > >>>>>>>>>>> > >>>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", > >>>>>>>>>>> CSE_SVE_VL_CONSTANTS) > >>>>>>>>>>> > >>>>>>>>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", > >>>>>>>>>>> USE_NEW_VECTOR_COSTS) > >>>>>>>>>>> - > >>>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", > >>>>>>>>>>> MATCHED_VECTOR_THROUGHPUT) > >>>>>>>>>>> > >>>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", > >>>>>>>>>>> AVOID_CROSS_LOOP_FMA) > >>>>>>>>>>> diff --git a/gcc/config/aarch64/aarch64.cc > >>>>>>>>>>> b/gcc/config/aarch64/aarch64.cc > >>>>>>>>>>> index 43238aefef2..03806671c97 100644 > >>>>>>>>>>> --- a/gcc/config/aarch64/aarch64.cc > >>>>>>>>>>> +++ b/gcc/config/aarch64/aarch64.cc > >>>>>>>>>>> @@ -16566,16 +16566,6 @@ aarch64_vectorize_create_costs (vec_info > >>>>>>>>>>> *vinfo, bool costing_for_scalar) > >>>>>>>>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar); > >>>>>>>>>>> } > >>>>>>>>>>> > >>>>>>>>>>> -/* Return true if the current CPU should use the new costs > >>>>>>>>>>> defined > >>>>>>>>>>> - in GCC 11. This should be removed for GCC 12 and above, with > >>>>>>>>>>> the > >>>>>>>>>>> - costs applying to all CPUs instead. */ > >>>>>>>>>>> -static bool > >>>>>>>>>>> -aarch64_use_new_vector_costs_p () > >>>>>>>>>>> -{ > >>>>>>>>>>> - return (aarch64_tune_params.extra_tuning_flags > >>>>>>>>>>> - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS); > >>>>>>>>>>> -} > >>>>>>>>>>> - > >>>>>>>>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE. > >>>>>>>>>>> */ > >>>>>>>>>>> static const simd_vec_cost * > >>>>>>>>>>> aarch64_simd_vec_costs (tree vectype) > >>>>>>>>>>> @@ -17494,7 +17484,7 @@ aarch64_vector_costs::add_stmt_cost (int > >>>>>>>>>>> count, vect_cost_for_stmt kind, > >>>>>>>>>>> > >>>>>>>>>>> /* Do one-time initialization based on the vinfo. */ > >>>>>>>>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); > >>>>>>>>>>> - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ()) > >>>>>>>>>>> + if (!m_analyzed_vinfo) > >>>>>>>>>>> { > >>>>>>>>>>> if (loop_vinfo) > >>>>>>>>>>> analyze_loop_vinfo (loop_vinfo); > >>>>>>>>>>> @@ -17512,12 +17502,12 @@ aarch64_vector_costs::add_stmt_cost > >>>>>>>>>>> (int count, vect_cost_for_stmt kind, > >>>>>>>>>>> > >>>>>>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead > >>>>>>>>>>> of just looking at KIND. */ > >>>>>>>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) > >>>>>>>>>>> + if (stmt_info) > >>>>>>>>>>> { > >>>>>>>>>>> /* If we scalarize a strided store, the vectorizer costs one > >>>>>>>>>>> vec_to_scalar for each element. However, we can store the first > >>>>>>>>>>> element using an FP store without a separate extract step. */ > >>>>>>>>>>> - if (vect_is_store_elt_extraction (kind, stmt_info)) > >>>>>>>>>>> + if (vect_is_store_elt_extraction (kind, stmt_info) && > >>>>>>>>>>> count > 1) > >>>>>>>>>>> count -= 1; > >>>>>>>>>>> > >>>>>>>>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind, > >>>>>>>>>>> @@ -17577,7 +17567,7 @@ aarch64_vector_costs::add_stmt_cost (int > >>>>>>>>>>> count, vect_cost_for_stmt kind, > >>>>>>>>>>> else > >>>>>>>>>>> m_num_last_promote_demote = 0; > >>>>>>>>>>> > >>>>>>>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) > >>>>>>>>>>> + if (stmt_info) > >>>>>>>>>>> { > >>>>>>>>>>> /* Account for any extra "embedded" costs that apply additively > >>>>>>>>>>> to the base cost calculated above. */ > >>>>>>>>>>> @@ -17938,9 +17928,7 @@ aarch64_vector_costs::finish_cost (const > >>>>>>>>>>> vector_costs *uncast_scalar_costs) > >>>>>>>>>>> > >>>>>>>>>>> auto *scalar_costs > >>>>>>>>>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs); > >>>>>>>>>>> - if (loop_vinfo > >>>>>>>>>>> - && m_vec_flags > >>>>>>>>>>> - && aarch64_use_new_vector_costs_p ()) > >>>>>>>>>>> + if (loop_vinfo && m_vec_flags) > >>>>>>>>>>> { > >>>>>>>>>>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, > >>>>>>>>>>> m_costs[vect_body]); > >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h > >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h > >>>>>>>>>>> index eb9b89984b0..dafea96e924 100644 > >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h > >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h > >>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params > >>>>>>>>>>> cortexx925_tunings = > >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > >>>>>>>>>>> &generic_prefetch_tune, > >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > >>>>>>>>>>> index 6a098497759..ac001927959 100644 > >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > >>>>>>>>>>> @@ -55,7 +55,6 @@ static const struct tune_params > >>>>>>>>>>> fujitsu_monaka_tunings = > >>>>>>>>>>> 0, /* max_case_values. */ > >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND > >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. > >>>>>>>>>>> */ > >>>>>>>>>>> &generic_prefetch_tune, > >>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h > >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > >>>>>>>>>>> index 9b1cbfc5bd2..7b534831340 100644 > >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h > >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > >>>>>>>>>>> @@ -183,7 +183,6 @@ static const struct tune_params > >>>>>>>>>>> generic_armv8_a_tunings = > >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. > >>>>>>>>>>> */ > >>>>>>>>>>> &generic_prefetch_tune, > >>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h > >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > >>>>>>>>>>> index 48353a59939..562ef89c67b 100644 > >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h > >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > >>>>>>>>>>> @@ -249,7 +249,6 @@ static const struct tune_params > >>>>>>>>>>> generic_armv9_a_tunings = > >>>>>>>>>>> 0, /* max_case_values. */ > >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND > >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. > >>>>>>>>>>> */ > >>>>>>>>>>> &generic_armv9a_prefetch_tune, > >>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h > >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > >>>>>>>>>>> index c407b89a22f..fe4f7c10f73 100644 > >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h > >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > >>>>>>>>>>> @@ -156,7 +156,6 @@ static const struct tune_params > >>>>>>>>>>> neoverse512tvb_tunings = > >>>>>>>>>>> 0, /* max_case_values. */ > >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. > >>>>>>>>>>> */ > >>>>>>>>>>> &generic_prefetch_tune, > >>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h > >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h > >>>>>>>>>>> index 18199ac206c..56be77423cb 100644 > >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h > >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h > >>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params > >>>>>>>>>>> neoversen2_tunings = > >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > >>>>>>>>>>> &generic_prefetch_tune, > >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h > >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h > >>>>>>>>>>> index 4da85cfac0d..254ad5e27f8 100644 > >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h > >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h > >>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params > >>>>>>>>>>> neoversen3_tunings = > >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. > >>>>>>>>>>> */ > >>>>>>>>>>> &generic_prefetch_tune, > >>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h > >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h > >>>>>>>>>>> index dd9120eee48..c7241cf23d7 100644 > >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h > >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h > >>>>>>>>>>> @@ -227,7 +227,6 @@ static const struct tune_params > >>>>>>>>>>> neoversev1_tunings = > >>>>>>>>>>> 0, /* max_case_values. */ > >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h > >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h > >>>>>>>>>>> index 1369de73991..96f55940649 100644 > >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h > >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h > >>>>>>>>>>> @@ -232,7 +232,6 @@ static const struct tune_params > >>>>>>>>>>> neoversev2_tunings = > >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. > >>>>>>>>>>> */ > >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h > >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h > >>>>>>>>>>> index d8c82255378..f62ae67d355 100644 > >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h > >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h > >>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params > >>>>>>>>>>> neoversev3_tunings = > >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > >>>>>>>>>>> &generic_prefetch_tune, > >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h > >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h > >>>>>>>>>>> index 7f050501ede..0233baf5e34 100644 > >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h > >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h > >>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params > >>>>>>>>>>> neoversev3ae_tunings = > >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > >>>>>>>>>>> &generic_prefetch_tune, > >>>>>>>>>>> diff --git > >>>>>>>>>>> a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > >>>>>>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > >>>>>>>>>>> index 762805ff54b..c334b7a6875 100644 > >>>>>>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > >>>>>>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > >>>>>>>>>>> @@ -15,4 +15,4 @@ > >>>>>>>>>>> so we vectorize the offset calculation. This means that the > >>>>>>>>>>> 64-bit version needs two copies. */ > >>>>>>>>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, > >>>>>>>>>>> p[0-7]/z, \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ > >>>>>>>>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, > >>>>>>>>>>> p[0-7]/z, \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ > >>>>>>>>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, > >>>>>>>>>>> p[0-7]/z, \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ > >>>>>>>>>>> diff --git > >>>>>>>>>>> a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > >>>>>>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > >>>>>>>>>>> index f0ea58e38e2..94cc63049bc 100644 > >>>>>>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > >>>>>>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > >>>>>>>>>>> @@ -15,4 +15,4 @@ > >>>>>>>>>>> so we vectorize the offset calculation. This means that the > >>>>>>>>>>> 64-bit version needs two copies. */ > >>>>>>>>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], > >>>>>>>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ > >>>>>>>>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, > >>>>>>>>>>> p[0-7], \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ > >>>>>>>>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, > >>>>>>>>>>> p[0-7], \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Richard Biener <rguent...@suse.de> > >>>>>>>>>> SUSE Software Solutions Germany GmbH, > >>>>>>>>>> Frankenstrasse 146, 90461 Nuernberg, Germany; > >>>>>>>>>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > >>>>>>>>>> Nuernberg) > >