> On 14 Dec 2024, at 09:32, Richard Biener <rguent...@suse.de> wrote: > > External email: Use caution opening links or attachments > > >> Am 13.12.2024 um 18:00 schrieb Jennifer Schmitz <jschm...@nvidia.com>: >> >> >> >>> On 13 Dec 2024, at 13:40, Richard Biener <richard.guent...@gmail.com> wrote: >>> >>> External email: Use caution opening links or attachments >>> >>> >>>> On Thu, Dec 12, 2024 at 5:27 PM Jennifer Schmitz <jschm...@nvidia.com> >>>> wrote: >>>> >>>> >>>> >>>>> On 6 Dec 2024, at 08:41, Jennifer Schmitz <jschm...@nvidia.com> wrote: >>>>> >>>>> >>>>> >>>>>> On 5 Dec 2024, at 20:07, Richard Sandiford <richard.sandif...@arm.com> >>>>>> wrote: >>>>>> >>>>>> External email: Use caution opening links or attachments >>>>>> >>>>>> >>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes: >>>>>>>> On 5 Dec 2024, at 11:44, Richard Biener <rguent...@suse.de> wrote: >>>>>>>> >>>>>>>> External email: Use caution opening links or attachments >>>>>>>> >>>>>>>> >>>>>>>> On Thu, 5 Dec 2024, Jennifer Schmitz wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> On 17 Oct 2024, at 19:23, Richard Sandiford >>>>>>>>>> <richard.sandif...@arm.com> wrote: >>>>>>>>>> >>>>>>>>>> External email: Use caution opening links or attachments >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes: >>>>>>>>>>> [...] >>>>>>>>>>> Looking at the diff of the vect dumps (below is a section of the >>>>>>>>>>> diff for strided_store_2.c), it seemed odd that vec_to_scalar >>>>>>>>>>> operations cost 0 now, instead of the previous cost of 2: >>>>>>>>>>> >>>>>>>>>>> +strided_store_1.c:38:151: note: === vectorizable_operation === >>>>>>>>>>> +strided_store_1.c:38:151: note: vect_model_simple_cost: >>>>>>>>>>> inside_cost = 1, prologue_cost = 0 . >>>>>>>>>>> +strided_store_1.c:38:151: note: ==> examining statement: *_6 = >>>>>>>>>>> _7; >>>>>>>>>>> +strided_store_1.c:38:151: note: vect_is_simple_use: operand _3 + >>>>>>>>>>> 1.0e+0, type of def: internal >>>>>>>>>>> +strided_store_1.c:38:151: note: Vectorizing an unaligned access. >>>>>>>>>>> +Applying pattern match.pd:236, generic-match-9.cc:4128 >>>>>>>>>>> +Applying pattern match.pd:5285, generic-match-10.cc:4234 >>>>>>>>>>> +strided_store_1.c:38:151: note: vect_model_store_cost: >>>>>>>>>>> inside_cost = 12, prologue_cost = 0 . >>>>>>>>>>> *_2 1 times unaligned_load (misalign -1) costs 1 in body >>>>>>>>>>> -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue >>>>>>>>>>> _3 + 1.0e+0 1 times vector_stmt costs 1 in body >>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body >>>>>>>>>>> +<unknown> 1 times vector_load costs 1 in prologue >>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body >>>>>>>>>>> _7 1 times scalar_store costs 1 in body >>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body >>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body >>>>>>>>>>> _7 1 times scalar_store costs 1 in body >>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body >>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body >>>>>>>>>>> _7 1 times scalar_store costs 1 in body >>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body >>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body >>>>>>>>>>> _7 1 times scalar_store costs 1 in body >>>>>>>>>>> >>>>>>>>>>> Although the aarch64_use_new_vector_costs_p flag was used in >>>>>>>>>>> multiple places in aarch64.cc, the location that causes this >>>>>>>>>>> behavior is this one: >>>>>>>>>>> unsigned >>>>>>>>>>> aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt >>>>>>>>>>> kind, >>>>>>>>>>> stmt_vec_info stmt_info, slp_tree, >>>>>>>>>>> tree vectype, int misalign, >>>>>>>>>>> vect_cost_model_location where) >>>>>>>>>>> { >>>>>>>>>>> [...] >>>>>>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead >>>>>>>>>>> of just looking at KIND. */ >>>>>>>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >>>>>>>>>>> + if (stmt_info) >>>>>>>>>>> { >>>>>>>>>>> /* If we scalarize a strided store, the vectorizer costs one >>>>>>>>>>> vec_to_scalar for each element. However, we can store the first >>>>>>>>>>> element using an FP store without a separate extract step. */ >>>>>>>>>>> if (vect_is_store_elt_extraction (kind, stmt_info)) >>>>>>>>>>> count -= 1; >>>>>>>>>>> >>>>>>>>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind, >>>>>>>>>>> stmt_info, >>>>>>>>>>> stmt_cost); >>>>>>>>>>> >>>>>>>>>>> if (vectype && m_vec_flags) >>>>>>>>>>> stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind, >>>>>>>>>>> stmt_info, vectype, >>>>>>>>>>> where, stmt_cost); >>>>>>>>>>> } >>>>>>>>>>> [...] >>>>>>>>>>> return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil >>>>>>>>>>> ()); >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> Previously, for mtune=generic, this function returned a cost of 2 >>>>>>>>>>> for a vec_to_scalar operation in the vect body. Now "if >>>>>>>>>>> (stmt_info)" is entered and "if (vect_is_store_elt_extraction >>>>>>>>>>> (kind, stmt_info))" evaluates to true, which sets the count to 0 >>>>>>>>>>> and leads to a return value of 0. >>>>>>>>>> >>>>>>>>>> At the time the code was written, a scalarised store would be costed >>>>>>>>>> using one vec_to_scalar call into the backend, with the count >>>>>>>>>> parameter >>>>>>>>>> set to the number of elements being stored. The "count -= 1" was >>>>>>>>>> supposed to lop off the leading element extraction, since we can >>>>>>>>>> store >>>>>>>>>> lane 0 as a normal FP store. >>>>>>>>>> >>>>>>>>>> The target-independent costing was later reworked so that it costs >>>>>>>>>> each operation individually: >>>>>>>>>> >>>>>>>>>> for (i = 0; i < nstores; i++) >>>>>>>>>> { >>>>>>>>>> if (costing_p) >>>>>>>>>> { >>>>>>>>>> /* Only need vector extracting when there are more >>>>>>>>>> than one stores. */ >>>>>>>>>> if (nstores > 1) >>>>>>>>>> inside_cost >>>>>>>>>> += record_stmt_cost (cost_vec, 1, vec_to_scalar, >>>>>>>>>> stmt_info, 0, vect_body); >>>>>>>>>> /* Take a single lane vector type store as scalar >>>>>>>>>> store to avoid ICE like 110776. */ >>>>>>>>>> if (VECTOR_TYPE_P (ltype) >>>>>>>>>> && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) >>>>>>>>>> n_adjacent_stores++; >>>>>>>>>> else >>>>>>>>>> inside_cost >>>>>>>>>> += record_stmt_cost (cost_vec, 1, scalar_store, >>>>>>>>>> stmt_info, 0, vect_body); >>>>>>>>>> continue; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> Unfortunately, there's no easy way of telling whether a particular >>>>>>>>>> call >>>>>>>>>> is part of a group, and if so, which member of the group it is. >>>>>>>>>> >>>>>>>>>> I suppose we could give up on the attempt to be (somewhat) accurate >>>>>>>>>> and just disable the optimisation. Or we could restrict it to count >>>>>>>>>> > 1, >>>>>>>>>> since it might still be useful for gathers and scatters. >>>>>>>>> I tried restricting the calls to vect_is_store_elt_extraction to >>>>>>>>> count > 1 and it seems to resolve the issue of costing vec_to_scalar >>>>>>>>> operations with 0 (see patch below). >>>>>>>>> What are your thoughts on this? >>>>>>>> >>>>>>>> Why didn't you pursue instead moving the vec_to_scalar cost together >>>>>>>> with the n_adjacent_store handling? >>>>>>> When I continued working on this patch, we had already reached stage 3 >>>>>>> and I was hesitant to introduce changes to the middle-end that were not >>>>>>> previously covered by this patch. So I tried if the issue could not be >>>>>>> resolved by making a small change in the backend. >>>>>>> If you still advise to use the n_adjacent_store instead, I’m happy to >>>>>>> look into it again. >>>>>> >>>>>> If Richard's ok with adjusting vectorizable_store for GCC 15 (which it >>>>>> sounds like he is), then I agree that would be better. Otherwise we'd >>>>>> be creating technical debt to clean up for GCC 16. And it is a >>>>>> regression >>>>>> of sorts, so is stage 3 material from that POV. >>>>>> >>>>>> (Incidentally, AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS was itself a >>>>>> "let's clean this up next stage 1" thing, since we needed to add tuning >>>>>> for a new CPU late during the cycle. But of course, there were other >>>>>> priorities when stage 1 actually came around, so it never actually >>>>>> happened. Thanks again for being the one to sort this out.) >>>>> Thanks for your feedback. Then I will try to make it work in >>>>> vectorizable_store. >>>>> Best, >>>>> Jennifer >>>> Below is the updated patch with a suggestion for the changes in >>>> vectorizable_store. It resolves the issue with the vec_to_scalar >>>> operations that were individually costed with 0. >>>> We already tested it on aarch64, no regression, but we are still doing >>>> performance testing. >>>> Can you give some feedback in the meantime on the patch itself? >>>> Thanks, >>>> Jennifer >>>> >>>> >>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and >>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the >>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the >>>> default. To that end, the function aarch64_use_new_vector_costs_p and its >>>> uses >>>> were removed. To prevent costing vec_to_scalar operations with 0, as >>>> described in >>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html, >>>> we adjusted vectorizable_store such that the variable n_adjacent_stores >>>> also covers vec_to_scalar operations. This way vec_to_scalar operations >>>> are not costed individually, but as a group. >>>> >>>> Two tests were adjusted due to changes in codegen. In both cases, the >>>> old code performed loop unrolling once, but the new code does not: >>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with >>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic >>>> -moverride=tune=none): >>>> f_int64_t_32: >>>> cbz w3, .L92 >>>> mov x4, 0 >>>> uxtw x3, w3 >>>> + cntd x5 >>>> + whilelo p7.d, xzr, x3 >>>> + mov z29.s, w5 >>>> mov z31.s, w2 >>>> - whilelo p6.d, xzr, x3 >>>> - mov x2, x3 >>>> - index z30.s, #0, #1 >>>> - uqdecd x2 >>>> - ptrue p5.b, all >>>> - whilelo p7.d, xzr, x2 >>>> + index z30.d, #0, #1 >>>> + ptrue p6.b, all >>>> .p2align 3,,7 >>>> .L94: >>>> - ld1d z27.d, p7/z, [x0, #1, mul vl] >>>> - ld1d z28.d, p6/z, [x0] >>>> - movprfx z29, z31 >>>> - mul z29.s, p5/m, z29.s, z30.s >>>> - incw x4 >>>> - uunpklo z0.d, z29.s >>>> - uunpkhi z29.d, z29.s >>>> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3] >>>> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3] >>>> - add z25.d, z28.d, z25.d >>>> + ld1d z27.d, p7/z, [x0, x4, lsl 3] >>>> + movprfx z28, z31 >>>> + mul z28.s, p6/m, z28.s, z30.s >>>> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3] >>>> add z26.d, z27.d, z26.d >>>> - st1d z26.d, p7, [x0, #1, mul vl] >>>> - whilelo p7.d, x4, x2 >>>> - st1d z25.d, p6, [x0] >>>> - incw z30.s >>>> - incb x0, all, mul #2 >>>> - whilelo p6.d, x4, x3 >>>> + st1d z26.d, p7, [x0, x4, lsl 3] >>>> + add z30.s, z30.s, z29.s >>>> + incd x4 >>>> + whilelo p7.d, x4, x3 >>>> b.any .L94 >>>> .L92: >>>> ret >>>> >>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with >>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic >>>> -moverride=tune=none): >>>> f_int64_t_32: >>>> cbz w3, .L84 >>>> - addvl x5, x1, #1 >>>> mov x4, 0 >>>> uxtw x3, w3 >>>> - mov z31.s, w2 >>>> + cntd x5 >>>> whilelo p7.d, xzr, x3 >>>> - mov x2, x3 >>>> - index z30.s, #0, #1 >>>> - uqdecd x2 >>>> - ptrue p5.b, all >>>> - whilelo p6.d, xzr, x2 >>>> + mov z29.s, w5 >>>> + mov z31.s, w2 >>>> + index z30.d, #0, #1 >>>> + ptrue p6.b, all >>>> .p2align 3,,7 >>>> .L86: >>>> - ld1d z28.d, p7/z, [x1, x4, lsl 3] >>>> - ld1d z27.d, p6/z, [x5, x4, lsl 3] >>>> - movprfx z29, z30 >>>> - mul z29.s, p5/m, z29.s, z31.s >>>> - add z28.d, z28.d, #1 >>>> - uunpklo z26.d, z29.s >>>> - st1d z28.d, p7, [x0, z26.d, lsl 3] >>>> - incw x4 >>>> - uunpkhi z29.d, z29.s >>>> + ld1d z27.d, p7/z, [x1, x4, lsl 3] >>>> + movprfx z28, z30 >>>> + mul z28.s, p6/m, z28.s, z31.s >>>> add z27.d, z27.d, #1 >>>> - whilelo p6.d, x4, x2 >>>> - st1d z27.d, p7, [x0, z29.d, lsl 3] >>>> - incw z30.s >>>> + st1d z27.d, p7, [x0, z28.d, uxtw 3] >>>> + incd x4 >>>> + add z30.s, z30.s, z29.s >>>> whilelo p7.d, x4, x3 >>>> b.any .L86 >>>> .L84: >>>> ret >>>> >>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no >>>> regression. >>>> OK for mainline? >>>> >>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> >>>> >>>> gcc/ >>>> * tree-vect-stmts.cc (vectorizable_store): Extend the use of >>>> n_adjacent_stores to also cover vec_to_scalar operations. >>>> * config/aarch64/aarch64-tuning-flags.def: Remove >>>> use_new_vector_costs as tuning option. >>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p): >>>> Remove. >>>> (aarch64_vector_costs::add_stmt_cost): Remove use of >>>> aarch64_use_new_vector_costs_p. >>>> (aarch64_vector_costs::finish_cost): Remove use of >>>> aarch64_use_new_vector_costs_p. >>>> * config/aarch64/tuning_models/cortexx925.h: Remove >>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS. >>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise. >>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise. >>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise. >>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise. >>>> * config/aarch64/tuning_models/neoversen2.h: Likewise. >>>> * config/aarch64/tuning_models/neoversen3.h: Likewise. >>>> * config/aarch64/tuning_models/neoversev1.h: Likewise. >>>> * config/aarch64/tuning_models/neoversev2.h: Likewise. >>>> * config/aarch64/tuning_models/neoversev3.h: Likewise. >>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise. >>>> >>>> gcc/testsuite/ >>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome. >>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise. >>>> --- >>>> gcc/config/aarch64/aarch64-tuning-flags.def | 2 -- >>>> gcc/config/aarch64/aarch64.cc | 20 +++---------- >>>> gcc/config/aarch64/tuning_models/cortexx925.h | 1 - >>>> .../aarch64/tuning_models/fujitsu_monaka.h | 1 - >>>> .../aarch64/tuning_models/generic_armv8_a.h | 1 - >>>> .../aarch64/tuning_models/generic_armv9_a.h | 1 - >>>> .../aarch64/tuning_models/neoverse512tvb.h | 1 - >>>> gcc/config/aarch64/tuning_models/neoversen2.h | 1 - >>>> gcc/config/aarch64/tuning_models/neoversen3.h | 1 - >>>> gcc/config/aarch64/tuning_models/neoversev1.h | 1 - >>>> gcc/config/aarch64/tuning_models/neoversev2.h | 1 - >>>> gcc/config/aarch64/tuning_models/neoversev3.h | 1 - >>>> .../aarch64/tuning_models/neoversev3ae.h | 1 - >>>> .../gcc.target/aarch64/sve/strided_load_2.c | 2 +- >>>> .../gcc.target/aarch64/sve/strided_store_2.c | 2 +- >>>> gcc/tree-vect-stmts.cc | 29 ++++++++++--------- >>>> 16 files changed, 22 insertions(+), 44 deletions(-) >>>> >>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def >>>> b/gcc/config/aarch64/aarch64-tuning-flags.def >>>> index ffbff20e29c..1de633c739b 100644 >>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def >>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def >>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", >>>> CHEAP_SHIFT_EXTEND) >>>> >>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS) >>>> >>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS) >>>> - >>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", >>>> MATCHED_VECTOR_THROUGHPUT) >>>> >>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA) >>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc >>>> index 77a2a6bfa3a..71fba9cc63b 100644 >>>> --- a/gcc/config/aarch64/aarch64.cc >>>> +++ b/gcc/config/aarch64/aarch64.cc >>>> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo, >>>> bool costing_for_scalar) >>>> return new aarch64_vector_costs (vinfo, costing_for_scalar); >>>> } >>>> >>>> -/* Return true if the current CPU should use the new costs defined >>>> - in GCC 11. This should be removed for GCC 12 and above, with the >>>> - costs applying to all CPUs instead. */ >>>> -static bool >>>> -aarch64_use_new_vector_costs_p () >>>> -{ >>>> - return (aarch64_tune_params.extra_tuning_flags >>>> - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS); >>>> -} >>>> - >>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE. */ >>>> static const simd_vec_cost * >>>> aarch64_simd_vec_costs (tree vectype) >>>> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count, >>>> vect_cost_for_stmt kind, >>>> >>>> /* Do one-time initialization based on the vinfo. */ >>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); >>>> - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ()) >>>> + if (!m_analyzed_vinfo) >>>> { >>>> if (loop_vinfo) >>>> analyze_loop_vinfo (loop_vinfo); >>>> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count, >>>> vect_cost_for_stmt kind, >>>> >>>> /* Try to get a more accurate cost by looking at STMT_INFO instead >>>> of just looking at KIND. */ >>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >>>> + if (stmt_info) >>>> { >>>> /* If we scalarize a strided store, the vectorizer costs one >>>> vec_to_scalar for each element. However, we can store the first >>>> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count, >>>> vect_cost_for_stmt kind, >>>> else >>>> m_num_last_promote_demote = 0; >>>> >>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >>>> + if (stmt_info) >>>> { >>>> /* Account for any extra "embedded" costs that apply additively >>>> to the base cost calculated above. */ >>>> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const >>>> vector_costs *uncast_scalar_costs) >>>> >>>> auto *scalar_costs >>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs); >>>> - if (loop_vinfo >>>> - && m_vec_flags >>>> - && aarch64_use_new_vector_costs_p ()) >>>> + if (loop_vinfo && m_vec_flags) >>>> { >>>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >>>> m_costs[vect_body]); >>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h >>>> b/gcc/config/aarch64/tuning_models/cortexx925.h >>>> index b2ff716157a..0a8eff69307 100644 >>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h >>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h >>>> @@ -219,7 +219,6 @@ static const struct tune_params cortexx925_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>> &generic_prefetch_tune, >>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>> index 2d704ecd110..a564528f43d 100644 >>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings = >>>> 0, /* max_case_values. */ >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>> &generic_prefetch_tune, >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>> index bdd309ab03d..f090d5cde50 100644 >>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>> @@ -183,7 +183,6 @@ static const struct tune_params >>>> generic_armv8_a_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>> &generic_prefetch_tune, >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>> index a05a9ab92a2..4c33c147444 100644 >>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>> @@ -249,7 +249,6 @@ static const struct tune_params >>>> generic_armv9_a_tunings = >>>> 0, /* max_case_values. */ >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>> &generic_armv9a_prefetch_tune, >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>> index c407b89a22f..fe4f7c10f73 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>> @@ -156,7 +156,6 @@ static const struct tune_params neoverse512tvb_tunings >>>> = >>>> 0, /* max_case_values. */ >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>> &generic_prefetch_tune, >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h >>>> b/gcc/config/aarch64/tuning_models/neoversen2.h >>>> index fd5f8f37370..0c74068da2c 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>> &generic_prefetch_tune, >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h >>>> b/gcc/config/aarch64/tuning_models/neoversen3.h >>>> index 8b156c2fe4d..9d4e1be171a 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>> &generic_prefetch_tune, >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h >>>> b/gcc/config/aarch64/tuning_models/neoversev1.h >>>> index 23c121d8652..85a78bb2bef 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h >>>> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>> &generic_prefetch_tune, >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h >>>> b/gcc/config/aarch64/tuning_models/neoversev2.h >>>> index 40af5f47f4f..1dd452beb8d 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h >>>> @@ -232,7 +232,6 @@ static const struct tune_params neoversev2_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW >>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h >>>> b/gcc/config/aarch64/tuning_models/neoversev3.h >>>> index d65d74bfecf..d0ba5b1aef6 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>> &generic_prefetch_tune, >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>> index 7b7fa0b4b08..a1572048503 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>> &generic_prefetch_tune, >>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>> index 762805ff54b..c334b7a6875 100644 >>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>> @@ -15,4 +15,4 @@ >>>> so we vectorize the offset calculation. This means that the >>>> 64-bit version needs two copies. */ >>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, >>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ >>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ >>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ >>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>> index f0ea58e38e2..94cc63049bc 100644 >>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>> @@ -15,4 +15,4 @@ >>>> so we vectorize the offset calculation. This means that the >>>> 64-bit version needs two copies. */ >>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], >>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ >>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ >>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ >>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc >>>> index be1139a423c..6d7d28c4702 100644 >>>> --- a/gcc/tree-vect-stmts.cc >>>> +++ b/gcc/tree-vect-stmts.cc >>>> @@ -8834,19 +8834,16 @@ vectorizable_store (vec_info *vinfo, >>>> { >>>> if (costing_p) >>>> { >>>> - /* Only need vector extracting when there are more >>>> - than one stores. */ >>>> - if (nstores > 1) >>>> - inside_cost >>>> - += record_stmt_cost (cost_vec, 1, vec_to_scalar, >>>> - stmt_info, slp_node, >>>> - 0, vect_body); >>>> /* Take a single lane vector type store as scalar >>>> store to avoid ICE like 110776. */ >>>> - if (VECTOR_TYPE_P (ltype) >>>> - && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) >>>> + bool single_lane_vec_p = >>>> + VECTOR_TYPE_P (ltype) >>>> + && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U); >>>> + /* Only need vector extracting when there are more >>>> + than one stores. */ >>>> + if (nstores > 1 || single_lane_vec_p) >>>> n_adjacent_stores++; >>>> - else >>>> + if (!single_lane_vec_p) >>> >>> I think it's somewhat non-obvious that nstores > 1 and single_lane_vec_p >>> correlate. In fact I think that we always record a store, just for >>> single-element >>> vectors we record scalar stores. I suggest to here always to just >>> n_adjacent_stores++ >>> and below ... >>> >>>> inside_cost >>>> += record_stmt_cost (cost_vec, 1, scalar_store, >>>> stmt_info, 0, vect_body); >>>> @@ -8905,9 +8902,15 @@ vectorizable_store (vec_info *vinfo, >>>> if (costing_p) >>>> { >>>> if (n_adjacent_stores > 0) >>>> - vect_get_store_cost (vinfo, stmt_info, slp_node, >>>> n_adjacent_stores, >>>> - alignment_support_scheme, misalignment, >>>> - &inside_cost, cost_vec); >>>> + { >>>> + vect_get_store_cost (vinfo, stmt_info, slp_node, >>>> n_adjacent_stores, >>>> + alignment_support_scheme, misalignment, >>>> + &inside_cost, cost_vec); >>> >>> ... record n_adjacent_stores scalar_store when ltype is single-lane and >>> record >>> n_adjacent_stores vect_to_scalar if nstores > 1 (and else none). >>> >>> Richard. >> Thanks for the feedback, I’m glad it’s going in the right direction. Below >> is the updated patch, re-validated on aarch64. >> Thanks, Jennifer >> >> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and >> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the >> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the >> default. To that end, the function aarch64_use_new_vector_costs_p and its >> uses >> were removed. To prevent costing vec_to_scalar operations with 0, as >> described in >> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html, >> we adjusted vectorizable_store such that the variable n_adjacent_stores >> also covers vec_to_scalar operations. This way vec_to_scalar operations >> are not costed individually, but as a group. >> >> Two tests were adjusted due to changes in codegen. In both cases, the >> old code performed loop unrolling once, but the new code does not: >> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with >> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic >> -moverride=tune=none): >> f_int64_t_32: >> cbz w3, .L92 >> mov x4, 0 >> uxtw x3, w3 >> + cntd x5 >> + whilelo p7.d, xzr, x3 >> + mov z29.s, w5 >> mov z31.s, w2 >> - whilelo p6.d, xzr, x3 >> - mov x2, x3 >> - index z30.s, #0, #1 >> - uqdecd x2 >> - ptrue p5.b, all >> - whilelo p7.d, xzr, x2 >> + index z30.d, #0, #1 >> + ptrue p6.b, all >> .p2align 3,,7 >> .L94: >> - ld1d z27.d, p7/z, [x0, #1, mul vl] >> - ld1d z28.d, p6/z, [x0] >> - movprfx z29, z31 >> - mul z29.s, p5/m, z29.s, z30.s >> - incw x4 >> - uunpklo z0.d, z29.s >> - uunpkhi z29.d, z29.s >> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3] >> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3] >> - add z25.d, z28.d, z25.d >> + ld1d z27.d, p7/z, [x0, x4, lsl 3] >> + movprfx z28, z31 >> + mul z28.s, p6/m, z28.s, z30.s >> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3] >> add z26.d, z27.d, z26.d >> - st1d z26.d, p7, [x0, #1, mul vl] >> - whilelo p7.d, x4, x2 >> - st1d z25.d, p6, [x0] >> - incw z30.s >> - incb x0, all, mul #2 >> - whilelo p6.d, x4, x3 >> + st1d z26.d, p7, [x0, x4, lsl 3] >> + add z30.s, z30.s, z29.s >> + incd x4 >> + whilelo p7.d, x4, x3 >> b.any .L94 >> .L92: >> ret >> >> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with >> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic >> -moverride=tune=none): >> f_int64_t_32: >> cbz w3, .L84 >> - addvl x5, x1, #1 >> mov x4, 0 >> uxtw x3, w3 >> - mov z31.s, w2 >> + cntd x5 >> whilelo p7.d, xzr, x3 >> - mov x2, x3 >> - index z30.s, #0, #1 >> - uqdecd x2 >> - ptrue p5.b, all >> - whilelo p6.d, xzr, x2 >> + mov z29.s, w5 >> + mov z31.s, w2 >> + index z30.d, #0, #1 >> + ptrue p6.b, all >> .p2align 3,,7 >> .L86: >> - ld1d z28.d, p7/z, [x1, x4, lsl 3] >> - ld1d z27.d, p6/z, [x5, x4, lsl 3] >> - movprfx z29, z30 >> - mul z29.s, p5/m, z29.s, z31.s >> - add z28.d, z28.d, #1 >> - uunpklo z26.d, z29.s >> - st1d z28.d, p7, [x0, z26.d, lsl 3] >> - incw x4 >> - uunpkhi z29.d, z29.s >> + ld1d z27.d, p7/z, [x1, x4, lsl 3] >> + movprfx z28, z30 >> + mul z28.s, p6/m, z28.s, z31.s >> add z27.d, z27.d, #1 >> - whilelo p6.d, x4, x2 >> - st1d z27.d, p7, [x0, z29.d, lsl 3] >> - incw z30.s >> + st1d z27.d, p7, [x0, z28.d, uxtw 3] >> + incd x4 >> + add z30.s, z30.s, z29.s >> whilelo p7.d, x4, x3 >> b.any .L86 >> .L84: >> ret >> >> The patch was bootstrapped and tested on aarch64-linux-gnu, no >> regression. >> OK for mainline? >> >> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> >> >> gcc/ >> * tree-vect-stmts.cc (vectorizable_store): Extend the use of >> n_adjacent_stores to also cover vec_to_scalar operations. >> * config/aarch64/aarch64-tuning-flags.def: Remove >> use_new_vector_costs as tuning option. >> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p): >> Remove. >> (aarch64_vector_costs::add_stmt_cost): Remove use of >> aarch64_use_new_vector_costs_p. >> (aarch64_vector_costs::finish_cost): Remove use of >> aarch64_use_new_vector_costs_p. >> * config/aarch64/tuning_models/cortexx925.h: Remove >> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS. >> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise. >> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise. >> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise. >> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise. >> * config/aarch64/tuning_models/neoversen2.h: Likewise. >> * config/aarch64/tuning_models/neoversen3.h: Likewise. >> * config/aarch64/tuning_models/neoversev1.h: Likewise. >> * config/aarch64/tuning_models/neoversev2.h: Likewise. >> * config/aarch64/tuning_models/neoversev3.h: Likewise. >> * config/aarch64/tuning_models/neoversev3ae.h: Likewise. >> >> gcc/testsuite/ >> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome. >> * gcc.target/aarch64/sve/strided_store_2.c: Likewise. >> --- >> gcc/config/aarch64/aarch64-tuning-flags.def | 2 - >> gcc/config/aarch64/aarch64.cc | 20 ++-------- >> gcc/config/aarch64/tuning_models/cortexx925.h | 1 - >> .../aarch64/tuning_models/fujitsu_monaka.h | 1 - >> .../aarch64/tuning_models/generic_armv8_a.h | 1 - >> .../aarch64/tuning_models/generic_armv9_a.h | 1 - >> .../aarch64/tuning_models/neoverse512tvb.h | 1 - >> gcc/config/aarch64/tuning_models/neoversen2.h | 1 - >> gcc/config/aarch64/tuning_models/neoversen3.h | 1 - >> gcc/config/aarch64/tuning_models/neoversev1.h | 1 - >> gcc/config/aarch64/tuning_models/neoversev2.h | 1 - >> gcc/config/aarch64/tuning_models/neoversev3.h | 1 - >> .../aarch64/tuning_models/neoversev3ae.h | 1 - >> .../gcc.target/aarch64/sve/strided_load_2.c | 2 +- >> .../gcc.target/aarch64/sve/strided_store_2.c | 2 +- >> gcc/tree-vect-stmts.cc | 37 +++++++++++-------- >> 16 files changed, 27 insertions(+), 47 deletions(-) >> >> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def >> b/gcc/config/aarch64/aarch64-tuning-flags.def >> index ffbff20e29c..1de633c739b 100644 >> --- a/gcc/config/aarch64/aarch64-tuning-flags.def >> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def >> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", >> CHEAP_SHIFT_EXTEND) >> >> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS) >> >> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS) >> - >> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", >> MATCHED_VECTOR_THROUGHPUT) >> >> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA) >> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc >> index 77a2a6bfa3a..71fba9cc63b 100644 >> --- a/gcc/config/aarch64/aarch64.cc >> +++ b/gcc/config/aarch64/aarch64.cc >> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo, >> bool costing_for_scalar) >> return new aarch64_vector_costs (vinfo, costing_for_scalar); >> } >> >> -/* Return true if the current CPU should use the new costs defined >> - in GCC 11. This should be removed for GCC 12 and above, with the >> - costs applying to all CPUs instead. */ >> -static bool >> -aarch64_use_new_vector_costs_p () >> -{ >> - return (aarch64_tune_params.extra_tuning_flags >> - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS); >> -} >> - >> /* Return the appropriate SIMD costs for vectors of type VECTYPE. */ >> static const simd_vec_cost * >> aarch64_simd_vec_costs (tree vectype) >> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count, >> vect_cost_for_stmt kind, >> >> /* Do one-time initialization based on the vinfo. */ >> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); >> - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ()) >> + if (!m_analyzed_vinfo) >> { >> if (loop_vinfo) >> analyze_loop_vinfo (loop_vinfo); >> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count, >> vect_cost_for_stmt kind, >> >> /* Try to get a more accurate cost by looking at STMT_INFO instead >> of just looking at KIND. */ >> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >> + if (stmt_info) >> { >> /* If we scalarize a strided store, the vectorizer costs one >> vec_to_scalar for each element. However, we can store the first >> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count, >> vect_cost_for_stmt kind, >> else >> m_num_last_promote_demote = 0; >> >> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >> + if (stmt_info) >> { >> /* Account for any extra "embedded" costs that apply additively >> to the base cost calculated above. */ >> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const >> vector_costs *uncast_scalar_costs) >> >> auto *scalar_costs >> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs); >> - if (loop_vinfo >> - && m_vec_flags >> - && aarch64_use_new_vector_costs_p ()) >> + if (loop_vinfo && m_vec_flags) >> { >> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >> m_costs[vect_body]); >> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h >> b/gcc/config/aarch64/tuning_models/cortexx925.h >> index 5ebaf66e986..74772f3e15f 100644 >> --- a/gcc/config/aarch64/tuning_models/cortexx925.h >> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h >> @@ -221,7 +221,6 @@ static const struct tune_params cortexx925_tunings = >> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >> (AARCH64_EXTRA_TUNE_BASE >> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >> &generic_armv9a_prefetch_tune, >> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >> index 2d704ecd110..a564528f43d 100644 >> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings = >> 0, /* max_case_values. */ >> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >> (AARCH64_EXTRA_TUNE_BASE >> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >> &generic_prefetch_tune, >> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h >> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h >> index bdd309ab03d..f090d5cde50 100644 >> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h >> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h >> @@ -183,7 +183,6 @@ static const struct tune_params generic_armv8_a_tunings = >> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >> (AARCH64_EXTRA_TUNE_BASE >> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >> &generic_prefetch_tune, >> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h >> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h >> index 785e00946bc..7b5821183bc 100644 >> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h >> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h >> @@ -251,7 +251,6 @@ static const struct tune_params generic_armv9_a_tunings = >> 0, /* max_case_values. */ >> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >> (AARCH64_EXTRA_TUNE_BASE >> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >> &generic_armv9a_prefetch_tune, >> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h >> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h >> index 007f987154c..f7457df59e5 100644 >> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h >> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h >> @@ -156,7 +156,6 @@ static const struct tune_params neoverse512tvb_tunings = >> 0, /* max_case_values. */ >> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >> &generic_armv9a_prefetch_tune, >> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h >> b/gcc/config/aarch64/tuning_models/neoversen2.h >> index 32560d2f5f8..541b61c8179 100644 >> --- a/gcc/config/aarch64/tuning_models/neoversen2.h >> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h >> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings = >> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >> (AARCH64_EXTRA_TUNE_BASE >> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >> &generic_armv9a_prefetch_tune, >> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h >> b/gcc/config/aarch64/tuning_models/neoversen3.h >> index 2010bc4645b..eff668132a8 100644 >> --- a/gcc/config/aarch64/tuning_models/neoversen3.h >> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h >> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings = >> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >> (AARCH64_EXTRA_TUNE_BASE >> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >> &generic_armv9a_prefetch_tune, >> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h >> b/gcc/config/aarch64/tuning_models/neoversev1.h >> index c3751e32696..d11472b6e1e 100644 >> --- a/gcc/config/aarch64/tuning_models/neoversev1.h >> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h >> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings = >> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >> (AARCH64_EXTRA_TUNE_BASE >> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >> &generic_armv9a_prefetch_tune, >> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h >> b/gcc/config/aarch64/tuning_models/neoversev2.h >> index 80dbe5c806c..ee77ffdd3bc 100644 >> --- a/gcc/config/aarch64/tuning_models/neoversev2.h >> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h >> @@ -219,7 +219,6 @@ static const struct tune_params neoversev2_tunings = >> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >> (AARCH64_EXTRA_TUNE_BASE >> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW >> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */ >> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h >> b/gcc/config/aarch64/tuning_models/neoversev3.h >> index efe09e16d1e..6ef143ef7d5 100644 >> --- a/gcc/config/aarch64/tuning_models/neoversev3.h >> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h >> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings = >> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >> (AARCH64_EXTRA_TUNE_BASE >> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >> &generic_armv9a_prefetch_tune, >> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h >> b/gcc/config/aarch64/tuning_models/neoversev3ae.h >> index 66849f30889..96bdbf971f1 100644 >> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h >> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h >> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings = >> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >> (AARCH64_EXTRA_TUNE_BASE >> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >> &generic_armv9a_prefetch_tune, >> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >> index 762805ff54b..c334b7a6875 100644 >> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >> @@ -15,4 +15,4 @@ >> so we vectorize the offset calculation. This means that the >> 64-bit version needs two copies. */ >> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, >> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ >> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, >> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ >> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, >> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ >> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >> index f0ea58e38e2..94cc63049bc 100644 >> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >> @@ -15,4 +15,4 @@ >> so we vectorize the offset calculation. This means that the >> 64-bit version needs two copies. */ >> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], \[x[0-9]+, >> z[0-9]+.s, uxtw 2\]\n} 3 } } */ >> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], >> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ >> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], >> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ >> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc >> index be1139a423c..ab57163c243 100644 >> --- a/gcc/tree-vect-stmts.cc >> +++ b/gcc/tree-vect-stmts.cc >> @@ -8834,19 +8834,8 @@ vectorizable_store (vec_info *vinfo, >> { >> if (costing_p) >> { >> - /* Only need vector extracting when there are more >> - than one stores. */ >> - if (nstores > 1) >> - inside_cost >> - += record_stmt_cost (cost_vec, 1, vec_to_scalar, >> - stmt_info, slp_node, >> - 0, vect_body); >> - /* Take a single lane vector type store as scalar >> - store to avoid ICE like 110776. */ >> - if (VECTOR_TYPE_P (ltype) >> - && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) >> - n_adjacent_stores++; >> - else >> + n_adjacent_stores++; >> + if (!VECTOR_TYPE_P (ltype)) > > This should be combined with the Single lane Vector case belle > >> inside_cost >> += record_stmt_cost (cost_vec, 1, scalar_store, >> stmt_info, 0, vect_body); >> @@ -8905,9 +8894,25 @@ vectorizable_store (vec_info *vinfo, >> if (costing_p) >> { >> if (n_adjacent_stores > 0) >> - vect_get_store_cost (vinfo, stmt_info, slp_node, n_adjacent_stores, >> - alignment_support_scheme, misalignment, >> - &inside_cost, cost_vec); >> + { >> + /* Take a single lane vector type store as scalar >> + store to avoid ICE like 110776. */ >> + if (VECTOR_TYPE_P (ltype) >> + && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) >> + inside_cost >> + += record_stmt_cost (cost_vec, n_adjacent_stores, >> + scalar_store, stmt_info, 0, vect_body); >> + /* Only need vector extracting when there are more >> + than one stores. */ >> + if (nstores > 1) >> + inside_cost >> + += record_stmt_cost (cost_vec, n_adjacent_stores, >> + vec_to_scalar, stmt_info, slp_node, >> + 0, vect_body); >> + vect_get_store_cost (vinfo, stmt_info, slp_node, > > This should be Inlay done for Multi-lane vectors Thanks for the quick reply. As I am making the changes, I am wondering: Do we even need n_adjacent_stores anymore? It appears to always have the same value as nstores. Can we remove it and use nstores instead or does it still serve another purpose? Thanks, Jennifer > >> + n_adjacent_stores, alignment_support_scheme, >> + misalignment, &inside_cost, cost_vec); >> + } >> if (dump_enabled_p ()) >> dump_printf_loc (MSG_NOTE, vect_location, >> "vect_model_store_cost: inside_cost = %d, " >> -- >> 2.34.1 >>> >>>> + inside_cost >>>> + += record_stmt_cost (cost_vec, n_adjacent_stores, >>>> vec_to_scalar, >>>> + stmt_info, slp_node, >>>> + 0, vect_body); >>>> + } >>>> if (dump_enabled_p ()) >>>> dump_printf_loc (MSG_NOTE, vect_location, >>>> "vect_model_store_cost: inside_cost = %d, " >>>> -- >>>> 2.44.0 >>>> >>>> >>>>>> >>>>>> Richard >>>>>> >>>>>>> Thanks, >>>>>>> Jennifer >>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Jennifer >>>>>>>>> >>>>>>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>> tunable and >>>>>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the >>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the >>>>>>>>> default. To that end, the function aarch64_use_new_vector_costs_p and >>>>>>>>> its uses >>>>>>>>> were removed. To prevent costing vec_to_scalar operations with 0, as >>>>>>>>> described in >>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html, >>>>>>>>> we guarded the call to vect_is_store_elt_extraction in >>>>>>>>> aarch64_vector_costs::add_stmt_cost by count > 1. >>>>>>>>> >>>>>>>>> Two tests were adjusted due to changes in codegen. In both cases, the >>>>>>>>> old code performed loop unrolling once, but the new code does not: >>>>>>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with >>>>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic >>>>>>>>> -moverride=tune=none): >>>>>>>>> f_int64_t_32: >>>>>>>>> cbz w3, .L92 >>>>>>>>> mov x4, 0 >>>>>>>>> uxtw x3, w3 >>>>>>>>> + cntd x5 >>>>>>>>> + whilelo p7.d, xzr, x3 >>>>>>>>> + mov z29.s, w5 >>>>>>>>> mov z31.s, w2 >>>>>>>>> - whilelo p6.d, xzr, x3 >>>>>>>>> - mov x2, x3 >>>>>>>>> - index z30.s, #0, #1 >>>>>>>>> - uqdecd x2 >>>>>>>>> - ptrue p5.b, all >>>>>>>>> - whilelo p7.d, xzr, x2 >>>>>>>>> + index z30.d, #0, #1 >>>>>>>>> + ptrue p6.b, all >>>>>>>>> .p2align 3,,7 >>>>>>>>> .L94: >>>>>>>>> - ld1d z27.d, p7/z, [x0, #1, mul vl] >>>>>>>>> - ld1d z28.d, p6/z, [x0] >>>>>>>>> - movprfx z29, z31 >>>>>>>>> - mul z29.s, p5/m, z29.s, z30.s >>>>>>>>> - incw x4 >>>>>>>>> - uunpklo z0.d, z29.s >>>>>>>>> - uunpkhi z29.d, z29.s >>>>>>>>> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3] >>>>>>>>> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3] >>>>>>>>> - add z25.d, z28.d, z25.d >>>>>>>>> + ld1d z27.d, p7/z, [x0, x4, lsl 3] >>>>>>>>> + movprfx z28, z31 >>>>>>>>> + mul z28.s, p6/m, z28.s, z30.s >>>>>>>>> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3] >>>>>>>>> add z26.d, z27.d, z26.d >>>>>>>>> - st1d z26.d, p7, [x0, #1, mul vl] >>>>>>>>> - whilelo p7.d, x4, x2 >>>>>>>>> - st1d z25.d, p6, [x0] >>>>>>>>> - incw z30.s >>>>>>>>> - incb x0, all, mul #2 >>>>>>>>> - whilelo p6.d, x4, x3 >>>>>>>>> + st1d z26.d, p7, [x0, x4, lsl 3] >>>>>>>>> + add z30.s, z30.s, z29.s >>>>>>>>> + incd x4 >>>>>>>>> + whilelo p7.d, x4, x3 >>>>>>>>> b.any .L94 >>>>>>>>> .L92: >>>>>>>>> ret >>>>>>>>> >>>>>>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with >>>>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic >>>>>>>>> -moverride=tune=none): >>>>>>>>> f_int64_t_32: >>>>>>>>> cbz w3, .L84 >>>>>>>>> - addvl x5, x1, #1 >>>>>>>>> mov x4, 0 >>>>>>>>> uxtw x3, w3 >>>>>>>>> - mov z31.s, w2 >>>>>>>>> + cntd x5 >>>>>>>>> whilelo p7.d, xzr, x3 >>>>>>>>> - mov x2, x3 >>>>>>>>> - index z30.s, #0, #1 >>>>>>>>> - uqdecd x2 >>>>>>>>> - ptrue p5.b, all >>>>>>>>> - whilelo p6.d, xzr, x2 >>>>>>>>> + mov z29.s, w5 >>>>>>>>> + mov z31.s, w2 >>>>>>>>> + index z30.d, #0, #1 >>>>>>>>> + ptrue p6.b, all >>>>>>>>> .p2align 3,,7 >>>>>>>>> .L86: >>>>>>>>> - ld1d z28.d, p7/z, [x1, x4, lsl 3] >>>>>>>>> - ld1d z27.d, p6/z, [x5, x4, lsl 3] >>>>>>>>> - movprfx z29, z30 >>>>>>>>> - mul z29.s, p5/m, z29.s, z31.s >>>>>>>>> - add z28.d, z28.d, #1 >>>>>>>>> - uunpklo z26.d, z29.s >>>>>>>>> - st1d z28.d, p7, [x0, z26.d, lsl 3] >>>>>>>>> - incw x4 >>>>>>>>> - uunpkhi z29.d, z29.s >>>>>>>>> + ld1d z27.d, p7/z, [x1, x4, lsl 3] >>>>>>>>> + movprfx z28, z30 >>>>>>>>> + mul z28.s, p6/m, z28.s, z31.s >>>>>>>>> add z27.d, z27.d, #1 >>>>>>>>> - whilelo p6.d, x4, x2 >>>>>>>>> - st1d z27.d, p7, [x0, z29.d, lsl 3] >>>>>>>>> - incw z30.s >>>>>>>>> + st1d z27.d, p7, [x0, z28.d, uxtw 3] >>>>>>>>> + incd x4 >>>>>>>>> + add z30.s, z30.s, z29.s >>>>>>>>> whilelo p7.d, x4, x3 >>>>>>>>> b.any .L86 >>>>>>>>> .L84: >>>>>>>>> ret >>>>>>>>> >>>>>>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no >>>>>>>>> regression. We also ran SPEC2017 with -mcpu=generic on a Grace >>>>>>>>> machine and saw >>>>>>>>> no non-noise impact on performance. We would appreciate help with >>>>>>>>> wider >>>>>>>>> benchmarking on other platforms, if necessary. >>>>>>>>> OK for mainline? >>>>>>>>> >>>>>>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> >>>>>>>>> >>>>>>>>> gcc/ >>>>>>>>> * config/aarch64/aarch64-tuning-flags.def: Remove >>>>>>>>> use_new_vector_costs as tuning option. >>>>>>>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p): >>>>>>>>> Remove. >>>>>>>>> (aarch64_vector_costs::add_stmt_cost): Remove use of >>>>>>>>> aarch64_use_new_vector_costs_p and guard call to >>>>>>>>> vect_is_store_elt_extraction with count > 1. >>>>>>>>> (aarch64_vector_costs::finish_cost): Remove use of >>>>>>>>> aarch64_use_new_vector_costs_p. >>>>>>>>> * config/aarch64/tuning_models/cortexx925.h: Remove >>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS. >>>>>>>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise. >>>>>>>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise. >>>>>>>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise. >>>>>>>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise. >>>>>>>>> * config/aarch64/tuning_models/neoversen2.h: Likewise. >>>>>>>>> * config/aarch64/tuning_models/neoversen3.h: Likewise. >>>>>>>>> * config/aarch64/tuning_models/neoversev1.h: Likewise. >>>>>>>>> * config/aarch64/tuning_models/neoversev2.h: Likewise. >>>>>>>>> * config/aarch64/tuning_models/neoversev3.h: Likewise. >>>>>>>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise. >>>>>>>>> >>>>>>>>> gcc/testsuite/ >>>>>>>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome. >>>>>>>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise. >>>>>>>>> --- >>>>>>>>> gcc/config/aarch64/aarch64-tuning-flags.def | 2 -- >>>>>>>>> gcc/config/aarch64/aarch64.cc | 22 +++++-------------- >>>>>>>>> gcc/config/aarch64/tuning_models/cortexx925.h | 1 - >>>>>>>>> .../aarch64/tuning_models/fujitsu_monaka.h | 1 - >>>>>>>>> .../aarch64/tuning_models/generic_armv8_a.h | 1 - >>>>>>>>> .../aarch64/tuning_models/generic_armv9_a.h | 1 - >>>>>>>>> .../aarch64/tuning_models/neoverse512tvb.h | 1 - >>>>>>>>> gcc/config/aarch64/tuning_models/neoversen2.h | 1 - >>>>>>>>> gcc/config/aarch64/tuning_models/neoversen3.h | 1 - >>>>>>>>> gcc/config/aarch64/tuning_models/neoversev1.h | 1 - >>>>>>>>> gcc/config/aarch64/tuning_models/neoversev2.h | 1 - >>>>>>>>> gcc/config/aarch64/tuning_models/neoversev3.h | 1 - >>>>>>>>> .../aarch64/tuning_models/neoversev3ae.h | 1 - >>>>>>>>> .../gcc.target/aarch64/sve/strided_load_2.c | 2 +- >>>>>>>>> .../gcc.target/aarch64/sve/strided_store_2.c | 2 +- >>>>>>>>> 15 files changed, 7 insertions(+), 32 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def >>>>>>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def >>>>>>>>> index 5939602576b..ed345b13ed3 100644 >>>>>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def >>>>>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def >>>>>>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", >>>>>>>>> CHEAP_SHIFT_EXTEND) >>>>>>>>> >>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", >>>>>>>>> CSE_SVE_VL_CONSTANTS) >>>>>>>>> >>>>>>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", >>>>>>>>> USE_NEW_VECTOR_COSTS) >>>>>>>>> - >>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", >>>>>>>>> MATCHED_VECTOR_THROUGHPUT) >>>>>>>>> >>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", >>>>>>>>> AVOID_CROSS_LOOP_FMA) >>>>>>>>> diff --git a/gcc/config/aarch64/aarch64.cc >>>>>>>>> b/gcc/config/aarch64/aarch64.cc >>>>>>>>> index 43238aefef2..03806671c97 100644 >>>>>>>>> --- a/gcc/config/aarch64/aarch64.cc >>>>>>>>> +++ b/gcc/config/aarch64/aarch64.cc >>>>>>>>> @@ -16566,16 +16566,6 @@ aarch64_vectorize_create_costs (vec_info >>>>>>>>> *vinfo, bool costing_for_scalar) >>>>>>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar); >>>>>>>>> } >>>>>>>>> >>>>>>>>> -/* Return true if the current CPU should use the new costs defined >>>>>>>>> - in GCC 11. This should be removed for GCC 12 and above, with the >>>>>>>>> - costs applying to all CPUs instead. */ >>>>>>>>> -static bool >>>>>>>>> -aarch64_use_new_vector_costs_p () >>>>>>>>> -{ >>>>>>>>> - return (aarch64_tune_params.extra_tuning_flags >>>>>>>>> - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS); >>>>>>>>> -} >>>>>>>>> - >>>>>>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE. */ >>>>>>>>> static const simd_vec_cost * >>>>>>>>> aarch64_simd_vec_costs (tree vectype) >>>>>>>>> @@ -17494,7 +17484,7 @@ aarch64_vector_costs::add_stmt_cost (int >>>>>>>>> count, vect_cost_for_stmt kind, >>>>>>>>> >>>>>>>>> /* Do one-time initialization based on the vinfo. */ >>>>>>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); >>>>>>>>> - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ()) >>>>>>>>> + if (!m_analyzed_vinfo) >>>>>>>>> { >>>>>>>>> if (loop_vinfo) >>>>>>>>> analyze_loop_vinfo (loop_vinfo); >>>>>>>>> @@ -17512,12 +17502,12 @@ aarch64_vector_costs::add_stmt_cost (int >>>>>>>>> count, vect_cost_for_stmt kind, >>>>>>>>> >>>>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead >>>>>>>>> of just looking at KIND. */ >>>>>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >>>>>>>>> + if (stmt_info) >>>>>>>>> { >>>>>>>>> /* If we scalarize a strided store, the vectorizer costs one >>>>>>>>> vec_to_scalar for each element. However, we can store the first >>>>>>>>> element using an FP store without a separate extract step. */ >>>>>>>>> - if (vect_is_store_elt_extraction (kind, stmt_info)) >>>>>>>>> + if (vect_is_store_elt_extraction (kind, stmt_info) && count > >>>>>>>>> 1) >>>>>>>>> count -= 1; >>>>>>>>> >>>>>>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind, >>>>>>>>> @@ -17577,7 +17567,7 @@ aarch64_vector_costs::add_stmt_cost (int >>>>>>>>> count, vect_cost_for_stmt kind, >>>>>>>>> else >>>>>>>>> m_num_last_promote_demote = 0; >>>>>>>>> >>>>>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >>>>>>>>> + if (stmt_info) >>>>>>>>> { >>>>>>>>> /* Account for any extra "embedded" costs that apply additively >>>>>>>>> to the base cost calculated above. */ >>>>>>>>> @@ -17938,9 +17928,7 @@ aarch64_vector_costs::finish_cost (const >>>>>>>>> vector_costs *uncast_scalar_costs) >>>>>>>>> >>>>>>>>> auto *scalar_costs >>>>>>>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs); >>>>>>>>> - if (loop_vinfo >>>>>>>>> - && m_vec_flags >>>>>>>>> - && aarch64_use_new_vector_costs_p ()) >>>>>>>>> + if (loop_vinfo && m_vec_flags) >>>>>>>>> { >>>>>>>>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >>>>>>>>> m_costs[vect_body]); >>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h >>>>>>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h >>>>>>>>> index eb9b89984b0..dafea96e924 100644 >>>>>>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h >>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h >>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params >>>>>>>>> cortexx925_tunings = >>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>>>>>>> &generic_prefetch_tune, >>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>>>>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>>>>>>> index 6a098497759..ac001927959 100644 >>>>>>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>>>>>>> @@ -55,7 +55,6 @@ static const struct tune_params >>>>>>>>> fujitsu_monaka_tunings = >>>>>>>>> 0, /* max_case_values. */ >>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>>>>>>> &generic_prefetch_tune, >>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>>>>>>> index 9b1cbfc5bd2..7b534831340 100644 >>>>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>>>>>>> @@ -183,7 +183,6 @@ static const struct tune_params >>>>>>>>> generic_armv8_a_tunings = >>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>>>>>>> &generic_prefetch_tune, >>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>>>>>>> index 48353a59939..562ef89c67b 100644 >>>>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>>>>>>> @@ -249,7 +249,6 @@ static const struct tune_params >>>>>>>>> generic_armv9_a_tunings = >>>>>>>>> 0, /* max_case_values. */ >>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>>>>>>> &generic_armv9a_prefetch_tune, >>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>>>>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>>>>>>> index c407b89a22f..fe4f7c10f73 100644 >>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>>>>>>> @@ -156,7 +156,6 @@ static const struct tune_params >>>>>>>>> neoverse512tvb_tunings = >>>>>>>>> 0, /* max_case_values. */ >>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>>>>>>> &generic_prefetch_tune, >>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h >>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h >>>>>>>>> index 18199ac206c..56be77423cb 100644 >>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h >>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h >>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params >>>>>>>>> neoversen2_tunings = >>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>>>>>>> &generic_prefetch_tune, >>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h >>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h >>>>>>>>> index 4da85cfac0d..254ad5e27f8 100644 >>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h >>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h >>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params >>>>>>>>> neoversen3_tunings = >>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>>>>>>> &generic_prefetch_tune, >>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h >>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h >>>>>>>>> index dd9120eee48..c7241cf23d7 100644 >>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h >>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h >>>>>>>>> @@ -227,7 +227,6 @@ static const struct tune_params >>>>>>>>> neoversev1_tunings = >>>>>>>>> 0, /* max_case_values. */ >>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>>>>> | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h >>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h >>>>>>>>> index 1369de73991..96f55940649 100644 >>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h >>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h >>>>>>>>> @@ -232,7 +232,6 @@ static const struct tune_params >>>>>>>>> neoversev2_tunings = >>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW >>>>>>>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */ >>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h >>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h >>>>>>>>> index d8c82255378..f62ae67d355 100644 >>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h >>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h >>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params >>>>>>>>> neoversev3_tunings = >>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>>>>>>> &generic_prefetch_tune, >>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>>>>>>> index 7f050501ede..0233baf5e34 100644 >>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params >>>>>>>>> neoversev3ae_tunings = >>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>>>>>>> &generic_prefetch_tune, >>>>>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>>>>>>> index 762805ff54b..c334b7a6875 100644 >>>>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>>>>>>> @@ -15,4 +15,4 @@ >>>>>>>>> so we vectorize the offset calculation. This means that the >>>>>>>>> 64-bit version needs two copies. */ >>>>>>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, >>>>>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ >>>>>>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, >>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ >>>>>>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, >>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ >>>>>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>>>>>>> index f0ea58e38e2..94cc63049bc 100644 >>>>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>>>>>>> @@ -15,4 +15,4 @@ >>>>>>>>> so we vectorize the offset calculation. This means that the >>>>>>>>> 64-bit version needs two copies. */ >>>>>>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], >>>>>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ >>>>>>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], >>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ >>>>>>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], >>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Richard Biener <rguent...@suse.de> >>>>>>>> SUSE Software Solutions Germany GmbH, >>>>>>>> Frankenstrasse 146, 90461 Nuernberg, Germany; >>>>>>>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG >>>>>>>> Nuernberg)
smime.p7s
Description: S/MIME cryptographic signature