On Thu, Dec 12, 2024 at 5:27 PM Jennifer Schmitz <jschm...@nvidia.com> wrote: > > > > > On 6 Dec 2024, at 08:41, Jennifer Schmitz <jschm...@nvidia.com> wrote: > > > > > > > >> On 5 Dec 2024, at 20:07, Richard Sandiford <richard.sandif...@arm.com> > >> wrote: > >> > >> External email: Use caution opening links or attachments > >> > >> > >> Jennifer Schmitz <jschm...@nvidia.com> writes: > >>>> On 5 Dec 2024, at 11:44, Richard Biener <rguent...@suse.de> wrote: > >>>> > >>>> External email: Use caution opening links or attachments > >>>> > >>>> > >>>> On Thu, 5 Dec 2024, Jennifer Schmitz wrote: > >>>> > >>>>> > >>>>> > >>>>>> On 17 Oct 2024, at 19:23, Richard Sandiford > >>>>>> <richard.sandif...@arm.com> wrote: > >>>>>> > >>>>>> External email: Use caution opening links or attachments > >>>>>> > >>>>>> > >>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes: > >>>>>>> [...] > >>>>>>> Looking at the diff of the vect dumps (below is a section of the diff > >>>>>>> for strided_store_2.c), it seemed odd that vec_to_scalar operations > >>>>>>> cost 0 now, instead of the previous cost of 2: > >>>>>>> > >>>>>>> +strided_store_1.c:38:151: note: === vectorizable_operation === > >>>>>>> +strided_store_1.c:38:151: note: vect_model_simple_cost: > >>>>>>> inside_cost = 1, prologue_cost = 0 . > >>>>>>> +strided_store_1.c:38:151: note: ==> examining statement: *_6 = _7; > >>>>>>> +strided_store_1.c:38:151: note: vect_is_simple_use: operand _3 + > >>>>>>> 1.0e+0, type of def: internal > >>>>>>> +strided_store_1.c:38:151: note: Vectorizing an unaligned access. > >>>>>>> +Applying pattern match.pd:236, generic-match-9.cc:4128 > >>>>>>> +Applying pattern match.pd:5285, generic-match-10.cc:4234 > >>>>>>> +strided_store_1.c:38:151: note: vect_model_store_cost: inside_cost > >>>>>>> = 12, prologue_cost = 0 . > >>>>>>> *_2 1 times unaligned_load (misalign -1) costs 1 in body > >>>>>>> -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue > >>>>>>> _3 + 1.0e+0 1 times vector_stmt costs 1 in body > >>>>>>> -_7 1 times vec_to_scalar costs 2 in body > >>>>>>> +<unknown> 1 times vector_load costs 1 in prologue > >>>>>>> +_7 1 times vec_to_scalar costs 0 in body > >>>>>>> _7 1 times scalar_store costs 1 in body > >>>>>>> -_7 1 times vec_to_scalar costs 2 in body > >>>>>>> +_7 1 times vec_to_scalar costs 0 in body > >>>>>>> _7 1 times scalar_store costs 1 in body > >>>>>>> -_7 1 times vec_to_scalar costs 2 in body > >>>>>>> +_7 1 times vec_to_scalar costs 0 in body > >>>>>>> _7 1 times scalar_store costs 1 in body > >>>>>>> -_7 1 times vec_to_scalar costs 2 in body > >>>>>>> +_7 1 times vec_to_scalar costs 0 in body > >>>>>>> _7 1 times scalar_store costs 1 in body > >>>>>>> > >>>>>>> Although the aarch64_use_new_vector_costs_p flag was used in multiple > >>>>>>> places in aarch64.cc, the location that causes this behavior is this > >>>>>>> one: > >>>>>>> unsigned > >>>>>>> aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt > >>>>>>> kind, > >>>>>>> stmt_vec_info stmt_info, slp_tree, > >>>>>>> tree vectype, int misalign, > >>>>>>> vect_cost_model_location where) > >>>>>>> { > >>>>>>> [...] > >>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead > >>>>>>> of just looking at KIND. */ > >>>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) > >>>>>>> + if (stmt_info) > >>>>>>> { > >>>>>>> /* If we scalarize a strided store, the vectorizer costs one > >>>>>>> vec_to_scalar for each element. However, we can store the first > >>>>>>> element using an FP store without a separate extract step. */ > >>>>>>> if (vect_is_store_elt_extraction (kind, stmt_info)) > >>>>>>> count -= 1; > >>>>>>> > >>>>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind, > >>>>>>> stmt_info, > >>>>>>> stmt_cost); > >>>>>>> > >>>>>>> if (vectype && m_vec_flags) > >>>>>>> stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind, > >>>>>>> stmt_info, > >>>>>>> vectype, > >>>>>>> where, stmt_cost); > >>>>>>> } > >>>>>>> [...] > >>>>>>> return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil > >>>>>>> ()); > >>>>>>> } > >>>>>>> > >>>>>>> Previously, for mtune=generic, this function returned a cost of 2 for > >>>>>>> a vec_to_scalar operation in the vect body. Now "if (stmt_info)" is > >>>>>>> entered and "if (vect_is_store_elt_extraction (kind, stmt_info))" > >>>>>>> evaluates to true, which sets the count to 0 and leads to a return > >>>>>>> value of 0. > >>>>>> > >>>>>> At the time the code was written, a scalarised store would be costed > >>>>>> using one vec_to_scalar call into the backend, with the count parameter > >>>>>> set to the number of elements being stored. The "count -= 1" was > >>>>>> supposed to lop off the leading element extraction, since we can store > >>>>>> lane 0 as a normal FP store. > >>>>>> > >>>>>> The target-independent costing was later reworked so that it costs > >>>>>> each operation individually: > >>>>>> > >>>>>> for (i = 0; i < nstores; i++) > >>>>>> { > >>>>>> if (costing_p) > >>>>>> { > >>>>>> /* Only need vector extracting when there are more > >>>>>> than one stores. */ > >>>>>> if (nstores > 1) > >>>>>> inside_cost > >>>>>> += record_stmt_cost (cost_vec, 1, vec_to_scalar, > >>>>>> stmt_info, 0, vect_body); > >>>>>> /* Take a single lane vector type store as scalar > >>>>>> store to avoid ICE like 110776. */ > >>>>>> if (VECTOR_TYPE_P (ltype) > >>>>>> && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) > >>>>>> n_adjacent_stores++; > >>>>>> else > >>>>>> inside_cost > >>>>>> += record_stmt_cost (cost_vec, 1, scalar_store, > >>>>>> stmt_info, 0, vect_body); > >>>>>> continue; > >>>>>> } > >>>>>> > >>>>>> Unfortunately, there's no easy way of telling whether a particular call > >>>>>> is part of a group, and if so, which member of the group it is. > >>>>>> > >>>>>> I suppose we could give up on the attempt to be (somewhat) accurate > >>>>>> and just disable the optimisation. Or we could restrict it to count > > >>>>>> 1, > >>>>>> since it might still be useful for gathers and scatters. > >>>>> I tried restricting the calls to vect_is_store_elt_extraction to count > >>>>> > 1 and it seems to resolve the issue of costing vec_to_scalar > >>>>> operations with 0 (see patch below). > >>>>> What are your thoughts on this? > >>>> > >>>> Why didn't you pursue instead moving the vec_to_scalar cost together > >>>> with the n_adjacent_store handling? > >>> When I continued working on this patch, we had already reached stage 3 > >>> and I was hesitant to introduce changes to the middle-end that were not > >>> previously covered by this patch. So I tried if the issue could not be > >>> resolved by making a small change in the backend. > >>> If you still advise to use the n_adjacent_store instead, I’m happy to > >>> look into it again. > >> > >> If Richard's ok with adjusting vectorizable_store for GCC 15 (which it > >> sounds like he is), then I agree that would be better. Otherwise we'd > >> be creating technical debt to clean up for GCC 16. And it is a regression > >> of sorts, so is stage 3 material from that POV. > >> > >> (Incidentally, AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS was itself a > >> "let's clean this up next stage 1" thing, since we needed to add tuning > >> for a new CPU late during the cycle. But of course, there were other > >> priorities when stage 1 actually came around, so it never actually > >> happened. Thanks again for being the one to sort this out.) > > Thanks for your feedback. Then I will try to make it work in > > vectorizable_store. > > Best, > > Jennifer > Below is the updated patch with a suggestion for the changes in > vectorizable_store. It resolves the issue with the vec_to_scalar operations > that were individually costed with 0. > We already tested it on aarch64, no regression, but we are still doing > performance testing. > Can you give some feedback in the meantime on the patch itself? > Thanks, > Jennifer > > > This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and > use_new_vector_costs entry in aarch64-tuning-flags.def and makes the > AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the > default. To that end, the function aarch64_use_new_vector_costs_p and its uses > were removed. To prevent costing vec_to_scalar operations with 0, as > described in > https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html, > we adjusted vectorizable_store such that the variable n_adjacent_stores > also covers vec_to_scalar operations. This way vec_to_scalar operations > are not costed individually, but as a group. > > Two tests were adjusted due to changes in codegen. In both cases, the > old code performed loop unrolling once, but the new code does not: > Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with > -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic > -moverride=tune=none): > f_int64_t_32: > cbz w3, .L92 > mov x4, 0 > uxtw x3, w3 > + cntd x5 > + whilelo p7.d, xzr, x3 > + mov z29.s, w5 > mov z31.s, w2 > - whilelo p6.d, xzr, x3 > - mov x2, x3 > - index z30.s, #0, #1 > - uqdecd x2 > - ptrue p5.b, all > - whilelo p7.d, xzr, x2 > + index z30.d, #0, #1 > + ptrue p6.b, all > .p2align 3,,7 > .L94: > - ld1d z27.d, p7/z, [x0, #1, mul vl] > - ld1d z28.d, p6/z, [x0] > - movprfx z29, z31 > - mul z29.s, p5/m, z29.s, z30.s > - incw x4 > - uunpklo z0.d, z29.s > - uunpkhi z29.d, z29.s > - ld1d z25.d, p6/z, [x1, z0.d, lsl 3] > - ld1d z26.d, p7/z, [x1, z29.d, lsl 3] > - add z25.d, z28.d, z25.d > + ld1d z27.d, p7/z, [x0, x4, lsl 3] > + movprfx z28, z31 > + mul z28.s, p6/m, z28.s, z30.s > + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3] > add z26.d, z27.d, z26.d > - st1d z26.d, p7, [x0, #1, mul vl] > - whilelo p7.d, x4, x2 > - st1d z25.d, p6, [x0] > - incw z30.s > - incb x0, all, mul #2 > - whilelo p6.d, x4, x3 > + st1d z26.d, p7, [x0, x4, lsl 3] > + add z30.s, z30.s, z29.s > + incd x4 > + whilelo p7.d, x4, x3 > b.any .L94 > .L92: > ret > > Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with > -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic > -moverride=tune=none): > f_int64_t_32: > cbz w3, .L84 > - addvl x5, x1, #1 > mov x4, 0 > uxtw x3, w3 > - mov z31.s, w2 > + cntd x5 > whilelo p7.d, xzr, x3 > - mov x2, x3 > - index z30.s, #0, #1 > - uqdecd x2 > - ptrue p5.b, all > - whilelo p6.d, xzr, x2 > + mov z29.s, w5 > + mov z31.s, w2 > + index z30.d, #0, #1 > + ptrue p6.b, all > .p2align 3,,7 > .L86: > - ld1d z28.d, p7/z, [x1, x4, lsl 3] > - ld1d z27.d, p6/z, [x5, x4, lsl 3] > - movprfx z29, z30 > - mul z29.s, p5/m, z29.s, z31.s > - add z28.d, z28.d, #1 > - uunpklo z26.d, z29.s > - st1d z28.d, p7, [x0, z26.d, lsl 3] > - incw x4 > - uunpkhi z29.d, z29.s > + ld1d z27.d, p7/z, [x1, x4, lsl 3] > + movprfx z28, z30 > + mul z28.s, p6/m, z28.s, z31.s > add z27.d, z27.d, #1 > - whilelo p6.d, x4, x2 > - st1d z27.d, p7, [x0, z29.d, lsl 3] > - incw z30.s > + st1d z27.d, p7, [x0, z28.d, uxtw 3] > + incd x4 > + add z30.s, z30.s, z29.s > whilelo p7.d, x4, x3 > b.any .L86 > .L84: > ret > > The patch was bootstrapped and tested on aarch64-linux-gnu, no > regression. > OK for mainline? > > Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> > > gcc/ > * tree-vect-stmts.cc (vectorizable_store): Extend the use of > n_adjacent_stores to also cover vec_to_scalar operations. > * config/aarch64/aarch64-tuning-flags.def: Remove > use_new_vector_costs as tuning option. > * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p): > Remove. > (aarch64_vector_costs::add_stmt_cost): Remove use of > aarch64_use_new_vector_costs_p. > (aarch64_vector_costs::finish_cost): Remove use of > aarch64_use_new_vector_costs_p. > * config/aarch64/tuning_models/cortexx925.h: Remove > AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS. > * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise. > * config/aarch64/tuning_models/generic_armv8_a.h: Likewise. > * config/aarch64/tuning_models/generic_armv9_a.h: Likewise. > * config/aarch64/tuning_models/neoverse512tvb.h: Likewise. > * config/aarch64/tuning_models/neoversen2.h: Likewise. > * config/aarch64/tuning_models/neoversen3.h: Likewise. > * config/aarch64/tuning_models/neoversev1.h: Likewise. > * config/aarch64/tuning_models/neoversev2.h: Likewise. > * config/aarch64/tuning_models/neoversev3.h: Likewise. > * config/aarch64/tuning_models/neoversev3ae.h: Likewise. > > gcc/testsuite/ > * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome. > * gcc.target/aarch64/sve/strided_store_2.c: Likewise. > --- > gcc/config/aarch64/aarch64-tuning-flags.def | 2 -- > gcc/config/aarch64/aarch64.cc | 20 +++---------- > gcc/config/aarch64/tuning_models/cortexx925.h | 1 - > .../aarch64/tuning_models/fujitsu_monaka.h | 1 - > .../aarch64/tuning_models/generic_armv8_a.h | 1 - > .../aarch64/tuning_models/generic_armv9_a.h | 1 - > .../aarch64/tuning_models/neoverse512tvb.h | 1 - > gcc/config/aarch64/tuning_models/neoversen2.h | 1 - > gcc/config/aarch64/tuning_models/neoversen3.h | 1 - > gcc/config/aarch64/tuning_models/neoversev1.h | 1 - > gcc/config/aarch64/tuning_models/neoversev2.h | 1 - > gcc/config/aarch64/tuning_models/neoversev3.h | 1 - > .../aarch64/tuning_models/neoversev3ae.h | 1 - > .../gcc.target/aarch64/sve/strided_load_2.c | 2 +- > .../gcc.target/aarch64/sve/strided_store_2.c | 2 +- > gcc/tree-vect-stmts.cc | 29 ++++++++++--------- > 16 files changed, 22 insertions(+), 44 deletions(-) > > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def > b/gcc/config/aarch64/aarch64-tuning-flags.def > index ffbff20e29c..1de633c739b 100644 > --- a/gcc/config/aarch64/aarch64-tuning-flags.def > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def > @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", > CHEAP_SHIFT_EXTEND) > > AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS) > > -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS) > - > AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", > MATCHED_VECTOR_THROUGHPUT) > > AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA) > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index 77a2a6bfa3a..71fba9cc63b 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo, > bool costing_for_scalar) > return new aarch64_vector_costs (vinfo, costing_for_scalar); > } > > -/* Return true if the current CPU should use the new costs defined > - in GCC 11. This should be removed for GCC 12 and above, with the > - costs applying to all CPUs instead. */ > -static bool > -aarch64_use_new_vector_costs_p () > -{ > - return (aarch64_tune_params.extra_tuning_flags > - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS); > -} > - > /* Return the appropriate SIMD costs for vectors of type VECTYPE. */ > static const simd_vec_cost * > aarch64_simd_vec_costs (tree vectype) > @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count, > vect_cost_for_stmt kind, > > /* Do one-time initialization based on the vinfo. */ > loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); > - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ()) > + if (!m_analyzed_vinfo) > { > if (loop_vinfo) > analyze_loop_vinfo (loop_vinfo); > @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count, > vect_cost_for_stmt kind, > > /* Try to get a more accurate cost by looking at STMT_INFO instead > of just looking at KIND. */ > - if (stmt_info && aarch64_use_new_vector_costs_p ()) > + if (stmt_info) > { > /* If we scalarize a strided store, the vectorizer costs one > vec_to_scalar for each element. However, we can store the first > @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count, > vect_cost_for_stmt kind, > else > m_num_last_promote_demote = 0; > > - if (stmt_info && aarch64_use_new_vector_costs_p ()) > + if (stmt_info) > { > /* Account for any extra "embedded" costs that apply additively > to the base cost calculated above. */ > @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const vector_costs > *uncast_scalar_costs) > > auto *scalar_costs > = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs); > - if (loop_vinfo > - && m_vec_flags > - && aarch64_use_new_vector_costs_p ()) > + if (loop_vinfo && m_vec_flags) > { > m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, > m_costs[vect_body]); > diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h > b/gcc/config/aarch64/tuning_models/cortexx925.h > index b2ff716157a..0a8eff69307 100644 > --- a/gcc/config/aarch64/tuning_models/cortexx925.h > +++ b/gcc/config/aarch64/tuning_models/cortexx925.h > @@ -219,7 +219,6 @@ static const struct tune_params cortexx925_tunings = > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_BASE > | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > &generic_prefetch_tune, > diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > index 2d704ecd110..a564528f43d 100644 > --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings = > 0, /* max_case_values. */ > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_BASE > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > &generic_prefetch_tune, > AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h > b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > index bdd309ab03d..f090d5cde50 100644 > --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h > +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > @@ -183,7 +183,6 @@ static const struct tune_params generic_armv8_a_tunings = > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_BASE > | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > &generic_prefetch_tune, > AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h > b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > index a05a9ab92a2..4c33c147444 100644 > --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h > +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > @@ -249,7 +249,6 @@ static const struct tune_params generic_armv9_a_tunings = > 0, /* max_case_values. */ > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_BASE > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > &generic_armv9a_prefetch_tune, > AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h > b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > index c407b89a22f..fe4f7c10f73 100644 > --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h > +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > @@ -156,7 +156,6 @@ static const struct tune_params neoverse512tvb_tunings = > 0, /* max_case_values. */ > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > &generic_prefetch_tune, > AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h > b/gcc/config/aarch64/tuning_models/neoversen2.h > index fd5f8f37370..0c74068da2c 100644 > --- a/gcc/config/aarch64/tuning_models/neoversen2.h > +++ b/gcc/config/aarch64/tuning_models/neoversen2.h > @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings = > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_BASE > | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > &generic_prefetch_tune, > diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h > b/gcc/config/aarch64/tuning_models/neoversen3.h > index 8b156c2fe4d..9d4e1be171a 100644 > --- a/gcc/config/aarch64/tuning_models/neoversen3.h > +++ b/gcc/config/aarch64/tuning_models/neoversen3.h > @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings = > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_BASE > | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > &generic_prefetch_tune, > AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h > b/gcc/config/aarch64/tuning_models/neoversev1.h > index 23c121d8652..85a78bb2bef 100644 > --- a/gcc/config/aarch64/tuning_models/neoversev1.h > +++ b/gcc/config/aarch64/tuning_models/neoversev1.h > @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings = > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_BASE > | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > &generic_prefetch_tune, > diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h > b/gcc/config/aarch64/tuning_models/neoversev2.h > index 40af5f47f4f..1dd452beb8d 100644 > --- a/gcc/config/aarch64/tuning_models/neoversev2.h > +++ b/gcc/config/aarch64/tuning_models/neoversev2.h > @@ -232,7 +232,6 @@ static const struct tune_params neoversev2_tunings = > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_BASE > | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW > | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */ > diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h > b/gcc/config/aarch64/tuning_models/neoversev3.h > index d65d74bfecf..d0ba5b1aef6 100644 > --- a/gcc/config/aarch64/tuning_models/neoversev3.h > +++ b/gcc/config/aarch64/tuning_models/neoversev3.h > @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings = > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_BASE > | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > &generic_prefetch_tune, > diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h > b/gcc/config/aarch64/tuning_models/neoversev3ae.h > index 7b7fa0b4b08..a1572048503 100644 > --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h > +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h > @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings = > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > (AARCH64_EXTRA_TUNE_BASE > | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > &generic_prefetch_tune, > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > index 762805ff54b..c334b7a6875 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > @@ -15,4 +15,4 @@ > so we vectorize the offset calculation. This means that the > 64-bit version needs two copies. */ > /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, > \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ > -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, > \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ > +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, > \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > index f0ea58e38e2..94cc63049bc 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > @@ -15,4 +15,4 @@ > so we vectorize the offset calculation. This means that the > 64-bit version needs two copies. */ > /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], \[x[0-9]+, > z[0-9]+.s, uxtw 2\]\n} 3 } } */ > -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[x[0-9]+, > z[0-9]+.d, lsl 3\]\n} 15 } } */ > +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[x[0-9]+, > z[0-9]+.d, lsl 3\]\n} 9 } } */ > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > index be1139a423c..6d7d28c4702 100644 > --- a/gcc/tree-vect-stmts.cc > +++ b/gcc/tree-vect-stmts.cc > @@ -8834,19 +8834,16 @@ vectorizable_store (vec_info *vinfo, > { > if (costing_p) > { > - /* Only need vector extracting when there are more > - than one stores. */ > - if (nstores > 1) > - inside_cost > - += record_stmt_cost (cost_vec, 1, vec_to_scalar, > - stmt_info, slp_node, > - 0, vect_body); > /* Take a single lane vector type store as scalar > store to avoid ICE like 110776. */ > - if (VECTOR_TYPE_P (ltype) > - && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) > + bool single_lane_vec_p = > + VECTOR_TYPE_P (ltype) > + && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U); > + /* Only need vector extracting when there are more > + than one stores. */ > + if (nstores > 1 || single_lane_vec_p) > n_adjacent_stores++; > - else > + if (!single_lane_vec_p)
I think it's somewhat non-obvious that nstores > 1 and single_lane_vec_p correlate. In fact I think that we always record a store, just for single-element vectors we record scalar stores. I suggest to here always to just n_adjacent_stores++ and below ... > inside_cost > += record_stmt_cost (cost_vec, 1, scalar_store, > stmt_info, 0, vect_body); > @@ -8905,9 +8902,15 @@ vectorizable_store (vec_info *vinfo, > if (costing_p) > { > if (n_adjacent_stores > 0) > - vect_get_store_cost (vinfo, stmt_info, slp_node, > n_adjacent_stores, > - alignment_support_scheme, misalignment, > - &inside_cost, cost_vec); > + { > + vect_get_store_cost (vinfo, stmt_info, slp_node, > n_adjacent_stores, > + alignment_support_scheme, misalignment, > + &inside_cost, cost_vec); ... record n_adjacent_stores scalar_store when ltype is single-lane and record n_adjacent_stores vect_to_scalar if nstores > 1 (and else none). Richard. > + inside_cost > + += record_stmt_cost (cost_vec, n_adjacent_stores, > vec_to_scalar, > + stmt_info, slp_node, > + 0, vect_body); > + } > if (dump_enabled_p ()) > dump_printf_loc (MSG_NOTE, vect_location, > "vect_model_store_cost: inside_cost = %d, " > -- > 2.44.0 > > > >> > >> Richard > >> > >>> Thanks, > >>> Jennifer > >>>> > >>>>> Thanks, > >>>>> Jennifer > >>>>> > >>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable > >>>>> and > >>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the > >>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the > >>>>> default. To that end, the function aarch64_use_new_vector_costs_p and > >>>>> its uses > >>>>> were removed. To prevent costing vec_to_scalar operations with 0, as > >>>>> described in > >>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html, > >>>>> we guarded the call to vect_is_store_elt_extraction in > >>>>> aarch64_vector_costs::add_stmt_cost by count > 1. > >>>>> > >>>>> Two tests were adjusted due to changes in codegen. In both cases, the > >>>>> old code performed loop unrolling once, but the new code does not: > >>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with > >>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic > >>>>> -moverride=tune=none): > >>>>> f_int64_t_32: > >>>>> cbz w3, .L92 > >>>>> mov x4, 0 > >>>>> uxtw x3, w3 > >>>>> + cntd x5 > >>>>> + whilelo p7.d, xzr, x3 > >>>>> + mov z29.s, w5 > >>>>> mov z31.s, w2 > >>>>> - whilelo p6.d, xzr, x3 > >>>>> - mov x2, x3 > >>>>> - index z30.s, #0, #1 > >>>>> - uqdecd x2 > >>>>> - ptrue p5.b, all > >>>>> - whilelo p7.d, xzr, x2 > >>>>> + index z30.d, #0, #1 > >>>>> + ptrue p6.b, all > >>>>> .p2align 3,,7 > >>>>> .L94: > >>>>> - ld1d z27.d, p7/z, [x0, #1, mul vl] > >>>>> - ld1d z28.d, p6/z, [x0] > >>>>> - movprfx z29, z31 > >>>>> - mul z29.s, p5/m, z29.s, z30.s > >>>>> - incw x4 > >>>>> - uunpklo z0.d, z29.s > >>>>> - uunpkhi z29.d, z29.s > >>>>> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3] > >>>>> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3] > >>>>> - add z25.d, z28.d, z25.d > >>>>> + ld1d z27.d, p7/z, [x0, x4, lsl 3] > >>>>> + movprfx z28, z31 > >>>>> + mul z28.s, p6/m, z28.s, z30.s > >>>>> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3] > >>>>> add z26.d, z27.d, z26.d > >>>>> - st1d z26.d, p7, [x0, #1, mul vl] > >>>>> - whilelo p7.d, x4, x2 > >>>>> - st1d z25.d, p6, [x0] > >>>>> - incw z30.s > >>>>> - incb x0, all, mul #2 > >>>>> - whilelo p6.d, x4, x3 > >>>>> + st1d z26.d, p7, [x0, x4, lsl 3] > >>>>> + add z30.s, z30.s, z29.s > >>>>> + incd x4 > >>>>> + whilelo p7.d, x4, x3 > >>>>> b.any .L94 > >>>>> .L92: > >>>>> ret > >>>>> > >>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with > >>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic > >>>>> -moverride=tune=none): > >>>>> f_int64_t_32: > >>>>> cbz w3, .L84 > >>>>> - addvl x5, x1, #1 > >>>>> mov x4, 0 > >>>>> uxtw x3, w3 > >>>>> - mov z31.s, w2 > >>>>> + cntd x5 > >>>>> whilelo p7.d, xzr, x3 > >>>>> - mov x2, x3 > >>>>> - index z30.s, #0, #1 > >>>>> - uqdecd x2 > >>>>> - ptrue p5.b, all > >>>>> - whilelo p6.d, xzr, x2 > >>>>> + mov z29.s, w5 > >>>>> + mov z31.s, w2 > >>>>> + index z30.d, #0, #1 > >>>>> + ptrue p6.b, all > >>>>> .p2align 3,,7 > >>>>> .L86: > >>>>> - ld1d z28.d, p7/z, [x1, x4, lsl 3] > >>>>> - ld1d z27.d, p6/z, [x5, x4, lsl 3] > >>>>> - movprfx z29, z30 > >>>>> - mul z29.s, p5/m, z29.s, z31.s > >>>>> - add z28.d, z28.d, #1 > >>>>> - uunpklo z26.d, z29.s > >>>>> - st1d z28.d, p7, [x0, z26.d, lsl 3] > >>>>> - incw x4 > >>>>> - uunpkhi z29.d, z29.s > >>>>> + ld1d z27.d, p7/z, [x1, x4, lsl 3] > >>>>> + movprfx z28, z30 > >>>>> + mul z28.s, p6/m, z28.s, z31.s > >>>>> add z27.d, z27.d, #1 > >>>>> - whilelo p6.d, x4, x2 > >>>>> - st1d z27.d, p7, [x0, z29.d, lsl 3] > >>>>> - incw z30.s > >>>>> + st1d z27.d, p7, [x0, z28.d, uxtw 3] > >>>>> + incd x4 > >>>>> + add z30.s, z30.s, z29.s > >>>>> whilelo p7.d, x4, x3 > >>>>> b.any .L86 > >>>>> .L84: > >>>>> ret > >>>>> > >>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no > >>>>> regression. We also ran SPEC2017 with -mcpu=generic on a Grace machine > >>>>> and saw > >>>>> no non-noise impact on performance. We would appreciate help with wider > >>>>> benchmarking on other platforms, if necessary. > >>>>> OK for mainline? > >>>>> > >>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> > >>>>> > >>>>> gcc/ > >>>>> * config/aarch64/aarch64-tuning-flags.def: Remove > >>>>> use_new_vector_costs as tuning option. > >>>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p): > >>>>> Remove. > >>>>> (aarch64_vector_costs::add_stmt_cost): Remove use of > >>>>> aarch64_use_new_vector_costs_p and guard call to > >>>>> vect_is_store_elt_extraction with count > 1. > >>>>> (aarch64_vector_costs::finish_cost): Remove use of > >>>>> aarch64_use_new_vector_costs_p. > >>>>> * config/aarch64/tuning_models/cortexx925.h: Remove > >>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS. > >>>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise. > >>>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise. > >>>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise. > >>>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise. > >>>>> * config/aarch64/tuning_models/neoversen2.h: Likewise. > >>>>> * config/aarch64/tuning_models/neoversen3.h: Likewise. > >>>>> * config/aarch64/tuning_models/neoversev1.h: Likewise. > >>>>> * config/aarch64/tuning_models/neoversev2.h: Likewise. > >>>>> * config/aarch64/tuning_models/neoversev3.h: Likewise. > >>>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise. > >>>>> > >>>>> gcc/testsuite/ > >>>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome. > >>>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise. > >>>>> --- > >>>>> gcc/config/aarch64/aarch64-tuning-flags.def | 2 -- > >>>>> gcc/config/aarch64/aarch64.cc | 22 +++++-------------- > >>>>> gcc/config/aarch64/tuning_models/cortexx925.h | 1 - > >>>>> .../aarch64/tuning_models/fujitsu_monaka.h | 1 - > >>>>> .../aarch64/tuning_models/generic_armv8_a.h | 1 - > >>>>> .../aarch64/tuning_models/generic_armv9_a.h | 1 - > >>>>> .../aarch64/tuning_models/neoverse512tvb.h | 1 - > >>>>> gcc/config/aarch64/tuning_models/neoversen2.h | 1 - > >>>>> gcc/config/aarch64/tuning_models/neoversen3.h | 1 - > >>>>> gcc/config/aarch64/tuning_models/neoversev1.h | 1 - > >>>>> gcc/config/aarch64/tuning_models/neoversev2.h | 1 - > >>>>> gcc/config/aarch64/tuning_models/neoversev3.h | 1 - > >>>>> .../aarch64/tuning_models/neoversev3ae.h | 1 - > >>>>> .../gcc.target/aarch64/sve/strided_load_2.c | 2 +- > >>>>> .../gcc.target/aarch64/sve/strided_store_2.c | 2 +- > >>>>> 15 files changed, 7 insertions(+), 32 deletions(-) > >>>>> > >>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def > >>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def > >>>>> index 5939602576b..ed345b13ed3 100644 > >>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def > >>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def > >>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", > >>>>> CHEAP_SHIFT_EXTEND) > >>>>> > >>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", > >>>>> CSE_SVE_VL_CONSTANTS) > >>>>> > >>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", > >>>>> USE_NEW_VECTOR_COSTS) > >>>>> - > >>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", > >>>>> MATCHED_VECTOR_THROUGHPUT) > >>>>> > >>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", > >>>>> AVOID_CROSS_LOOP_FMA) > >>>>> diff --git a/gcc/config/aarch64/aarch64.cc > >>>>> b/gcc/config/aarch64/aarch64.cc > >>>>> index 43238aefef2..03806671c97 100644 > >>>>> --- a/gcc/config/aarch64/aarch64.cc > >>>>> +++ b/gcc/config/aarch64/aarch64.cc > >>>>> @@ -16566,16 +16566,6 @@ aarch64_vectorize_create_costs (vec_info > >>>>> *vinfo, bool costing_for_scalar) > >>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar); > >>>>> } > >>>>> > >>>>> -/* Return true if the current CPU should use the new costs defined > >>>>> - in GCC 11. This should be removed for GCC 12 and above, with the > >>>>> - costs applying to all CPUs instead. */ > >>>>> -static bool > >>>>> -aarch64_use_new_vector_costs_p () > >>>>> -{ > >>>>> - return (aarch64_tune_params.extra_tuning_flags > >>>>> - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS); > >>>>> -} > >>>>> - > >>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE. */ > >>>>> static const simd_vec_cost * > >>>>> aarch64_simd_vec_costs (tree vectype) > >>>>> @@ -17494,7 +17484,7 @@ aarch64_vector_costs::add_stmt_cost (int count, > >>>>> vect_cost_for_stmt kind, > >>>>> > >>>>> /* Do one-time initialization based on the vinfo. */ > >>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); > >>>>> - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ()) > >>>>> + if (!m_analyzed_vinfo) > >>>>> { > >>>>> if (loop_vinfo) > >>>>> analyze_loop_vinfo (loop_vinfo); > >>>>> @@ -17512,12 +17502,12 @@ aarch64_vector_costs::add_stmt_cost (int > >>>>> count, vect_cost_for_stmt kind, > >>>>> > >>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead > >>>>> of just looking at KIND. */ > >>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) > >>>>> + if (stmt_info) > >>>>> { > >>>>> /* If we scalarize a strided store, the vectorizer costs one > >>>>> vec_to_scalar for each element. However, we can store the first > >>>>> element using an FP store without a separate extract step. */ > >>>>> - if (vect_is_store_elt_extraction (kind, stmt_info)) > >>>>> + if (vect_is_store_elt_extraction (kind, stmt_info) && count > 1) > >>>>> count -= 1; > >>>>> > >>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind, > >>>>> @@ -17577,7 +17567,7 @@ aarch64_vector_costs::add_stmt_cost (int count, > >>>>> vect_cost_for_stmt kind, > >>>>> else > >>>>> m_num_last_promote_demote = 0; > >>>>> > >>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) > >>>>> + if (stmt_info) > >>>>> { > >>>>> /* Account for any extra "embedded" costs that apply additively > >>>>> to the base cost calculated above. */ > >>>>> @@ -17938,9 +17928,7 @@ aarch64_vector_costs::finish_cost (const > >>>>> vector_costs *uncast_scalar_costs) > >>>>> > >>>>> auto *scalar_costs > >>>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs); > >>>>> - if (loop_vinfo > >>>>> - && m_vec_flags > >>>>> - && aarch64_use_new_vector_costs_p ()) > >>>>> + if (loop_vinfo && m_vec_flags) > >>>>> { > >>>>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, > >>>>> m_costs[vect_body]); > >>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h > >>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h > >>>>> index eb9b89984b0..dafea96e924 100644 > >>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h > >>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h > >>>>> @@ -219,7 +219,6 @@ static const struct tune_params cortexx925_tunings = > >>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND > >>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > >>>>> &generic_prefetch_tune, > >>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > >>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > >>>>> index 6a098497759..ac001927959 100644 > >>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > >>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h > >>>>> @@ -55,7 +55,6 @@ static const struct tune_params > >>>>> fujitsu_monaka_tunings = > >>>>> 0, /* max_case_values. */ > >>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND > >>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > >>>>> &generic_prefetch_tune, > >>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > >>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h > >>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > >>>>> index 9b1cbfc5bd2..7b534831340 100644 > >>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h > >>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > >>>>> @@ -183,7 +183,6 @@ static const struct tune_params > >>>>> generic_armv8_a_tunings = > >>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND > >>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > >>>>> &generic_prefetch_tune, > >>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > >>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h > >>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > >>>>> index 48353a59939..562ef89c67b 100644 > >>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h > >>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > >>>>> @@ -249,7 +249,6 @@ static const struct tune_params > >>>>> generic_armv9_a_tunings = > >>>>> 0, /* max_case_values. */ > >>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND > >>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > >>>>> &generic_armv9a_prefetch_tune, > >>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > >>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h > >>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > >>>>> index c407b89a22f..fe4f7c10f73 100644 > >>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h > >>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > >>>>> @@ -156,7 +156,6 @@ static const struct tune_params > >>>>> neoverse512tvb_tunings = > >>>>> 0, /* max_case_values. */ > >>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > >>>>> &generic_prefetch_tune, > >>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > >>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h > >>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h > >>>>> index 18199ac206c..56be77423cb 100644 > >>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h > >>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h > >>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings = > >>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND > >>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > >>>>> &generic_prefetch_tune, > >>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h > >>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h > >>>>> index 4da85cfac0d..254ad5e27f8 100644 > >>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h > >>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h > >>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings = > >>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND > >>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ > >>>>> &generic_prefetch_tune, > >>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ > >>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h > >>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h > >>>>> index dd9120eee48..c7241cf23d7 100644 > >>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h > >>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h > >>>>> @@ -227,7 +227,6 @@ static const struct tune_params neoversev1_tunings = > >>>>> 0, /* max_case_values. */ > >>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>>> | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND > >>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > >>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h > >>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h > >>>>> index 1369de73991..96f55940649 100644 > >>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h > >>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h > >>>>> @@ -232,7 +232,6 @@ static const struct tune_params neoversev2_tunings = > >>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND > >>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW > >>>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */ > >>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h > >>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h > >>>>> index d8c82255378..f62ae67d355 100644 > >>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h > >>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h > >>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings = > >>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND > >>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > >>>>> &generic_prefetch_tune, > >>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h > >>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h > >>>>> index 7f050501ede..0233baf5e34 100644 > >>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h > >>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h > >>>>> @@ -219,7 +219,6 @@ static const struct tune_params > >>>>> neoversev3ae_tunings = > >>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND > >>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS > >>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT > >>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ > >>>>> &generic_prefetch_tune, > >>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > >>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > >>>>> index 762805ff54b..c334b7a6875 100644 > >>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > >>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c > >>>>> @@ -15,4 +15,4 @@ > >>>>> so we vectorize the offset calculation. This means that the > >>>>> 64-bit version needs two copies. */ > >>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, > >>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ > >>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, > >>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ > >>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, > >>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ > >>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > >>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > >>>>> index f0ea58e38e2..94cc63049bc 100644 > >>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > >>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c > >>>>> @@ -15,4 +15,4 @@ > >>>>> so we vectorize the offset calculation. This means that the > >>>>> 64-bit version needs two copies. */ > >>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], > >>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ > >>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], > >>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ > >>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], > >>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ > >>>>> > >>>> > >>>> -- > >>>> Richard Biener <rguent...@suse.de> > >>>> SUSE Software Solutions Germany GmbH, > >>>> Frankenstrasse 146, 90461 Nuernberg, Germany; > >>>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > >>>> Nuernberg) > > > > >