> On 17 Dec 2024, at 18:57, Richard Biener <rguent...@suse.de> wrote: > > External email: Use caution opening links or attachments > > >> Am 16.12.2024 um 09:10 schrieb Jennifer Schmitz <jschm...@nvidia.com>: >> >> >> >>> On 14 Dec 2024, at 09:32, Richard Biener <rguent...@suse.de> wrote: >>> >>> External email: Use caution opening links or attachments >>> >>> >>>>> Am 13.12.2024 um 18:00 schrieb Jennifer Schmitz <jschm...@nvidia.com>: >>>> >>>> >>>> >>>>> On 13 Dec 2024, at 13:40, Richard Biener <richard.guent...@gmail.com> >>>>> wrote: >>>>> >>>>> External email: Use caution opening links or attachments >>>>> >>>>> >>>>>> On Thu, Dec 12, 2024 at 5:27 PM Jennifer Schmitz <jschm...@nvidia.com> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>>> On 6 Dec 2024, at 08:41, Jennifer Schmitz <jschm...@nvidia.com> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On 5 Dec 2024, at 20:07, Richard Sandiford <richard.sandif...@arm.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> External email: Use caution opening links or attachments >>>>>>>> >>>>>>>> >>>>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes: >>>>>>>>>> On 5 Dec 2024, at 11:44, Richard Biener <rguent...@suse.de> wrote: >>>>>>>>>> >>>>>>>>>> External email: Use caution opening links or attachments >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, 5 Dec 2024, Jennifer Schmitz wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On 17 Oct 2024, at 19:23, Richard Sandiford >>>>>>>>>>>> <richard.sandif...@arm.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> External email: Use caution opening links or attachments >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes: >>>>>>>>>>>>> [...] >>>>>>>>>>>>> Looking at the diff of the vect dumps (below is a section of the >>>>>>>>>>>>> diff for strided_store_2.c), it seemed odd that vec_to_scalar >>>>>>>>>>>>> operations cost 0 now, instead of the previous cost of 2: >>>>>>>>>>>>> >>>>>>>>>>>>> +strided_store_1.c:38:151: note: === vectorizable_operation === >>>>>>>>>>>>> +strided_store_1.c:38:151: note: vect_model_simple_cost: >>>>>>>>>>>>> inside_cost = 1, prologue_cost = 0 . >>>>>>>>>>>>> +strided_store_1.c:38:151: note: ==> examining statement: *_6 = >>>>>>>>>>>>> _7; >>>>>>>>>>>>> +strided_store_1.c:38:151: note: vect_is_simple_use: operand _3 >>>>>>>>>>>>> + 1.0e+0, type of def: internal >>>>>>>>>>>>> +strided_store_1.c:38:151: note: Vectorizing an unaligned >>>>>>>>>>>>> access. >>>>>>>>>>>>> +Applying pattern match.pd:236, generic-match-9.cc:4128 >>>>>>>>>>>>> +Applying pattern match.pd:5285, generic-match-10.cc:4234 >>>>>>>>>>>>> +strided_store_1.c:38:151: note: vect_model_store_cost: >>>>>>>>>>>>> inside_cost = 12, prologue_cost = 0 . >>>>>>>>>>>>> *_2 1 times unaligned_load (misalign -1) costs 1 in body >>>>>>>>>>>>> -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue >>>>>>>>>>>>> _3 + 1.0e+0 1 times vector_stmt costs 1 in body >>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body >>>>>>>>>>>>> +<unknown> 1 times vector_load costs 1 in prologue >>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body >>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body >>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body >>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body >>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body >>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body >>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body >>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body >>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body >>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body >>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body >>>>>>>>>>>>> >>>>>>>>>>>>> Although the aarch64_use_new_vector_costs_p flag was used in >>>>>>>>>>>>> multiple places in aarch64.cc, the location that causes this >>>>>>>>>>>>> behavior is this one: >>>>>>>>>>>>> unsigned >>>>>>>>>>>>> aarch64_vector_costs::add_stmt_cost (int count, >>>>>>>>>>>>> vect_cost_for_stmt kind, >>>>>>>>>>>>> stmt_vec_info stmt_info, slp_tree, >>>>>>>>>>>>> tree vectype, int misalign, >>>>>>>>>>>>> vect_cost_model_location where) >>>>>>>>>>>>> { >>>>>>>>>>>>> [...] >>>>>>>>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead >>>>>>>>>>>>> of just looking at KIND. */ >>>>>>>>>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >>>>>>>>>>>>> + if (stmt_info) >>>>>>>>>>>>> { >>>>>>>>>>>>> /* If we scalarize a strided store, the vectorizer costs one >>>>>>>>>>>>> vec_to_scalar for each element. However, we can store the first >>>>>>>>>>>>> element using an FP store without a separate extract step. */ >>>>>>>>>>>>> if (vect_is_store_elt_extraction (kind, stmt_info)) >>>>>>>>>>>>> count -= 1; >>>>>>>>>>>>> >>>>>>>>>>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind, >>>>>>>>>>>>> stmt_info, >>>>>>>>>>>>> stmt_cost); >>>>>>>>>>>>> >>>>>>>>>>>>> if (vectype && m_vec_flags) >>>>>>>>>>>>> stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind, >>>>>>>>>>>>> stmt_info, vectype, >>>>>>>>>>>>> where, stmt_cost); >>>>>>>>>>>>> } >>>>>>>>>>>>> [...] >>>>>>>>>>>>> return record_stmt_cost (stmt_info, where, (count * >>>>>>>>>>>>> stmt_cost).ceil ()); >>>>>>>>>>>>> } >>>>>>>>>>>>> >>>>>>>>>>>>> Previously, for mtune=generic, this function returned a cost of 2 >>>>>>>>>>>>> for a vec_to_scalar operation in the vect body. Now "if >>>>>>>>>>>>> (stmt_info)" is entered and "if (vect_is_store_elt_extraction >>>>>>>>>>>>> (kind, stmt_info))" evaluates to true, which sets the count to 0 >>>>>>>>>>>>> and leads to a return value of 0. >>>>>>>>>>>> >>>>>>>>>>>> At the time the code was written, a scalarised store would be >>>>>>>>>>>> costed >>>>>>>>>>>> using one vec_to_scalar call into the backend, with the count >>>>>>>>>>>> parameter >>>>>>>>>>>> set to the number of elements being stored. The "count -= 1" was >>>>>>>>>>>> supposed to lop off the leading element extraction, since we can >>>>>>>>>>>> store >>>>>>>>>>>> lane 0 as a normal FP store. >>>>>>>>>>>> >>>>>>>>>>>> The target-independent costing was later reworked so that it costs >>>>>>>>>>>> each operation individually: >>>>>>>>>>>> >>>>>>>>>>>> for (i = 0; i < nstores; i++) >>>>>>>>>>>> { >>>>>>>>>>>> if (costing_p) >>>>>>>>>>>> { >>>>>>>>>>>> /* Only need vector extracting when there are more >>>>>>>>>>>> than one stores. */ >>>>>>>>>>>> if (nstores > 1) >>>>>>>>>>>> inside_cost >>>>>>>>>>>> += record_stmt_cost (cost_vec, 1, vec_to_scalar, >>>>>>>>>>>> stmt_info, 0, vect_body); >>>>>>>>>>>> /* Take a single lane vector type store as scalar >>>>>>>>>>>> store to avoid ICE like 110776. */ >>>>>>>>>>>> if (VECTOR_TYPE_P (ltype) >>>>>>>>>>>> && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) >>>>>>>>>>>> n_adjacent_stores++; >>>>>>>>>>>> else >>>>>>>>>>>> inside_cost >>>>>>>>>>>> += record_stmt_cost (cost_vec, 1, scalar_store, >>>>>>>>>>>> stmt_info, 0, vect_body); >>>>>>>>>>>> continue; >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> Unfortunately, there's no easy way of telling whether a particular >>>>>>>>>>>> call >>>>>>>>>>>> is part of a group, and if so, which member of the group it is. >>>>>>>>>>>> >>>>>>>>>>>> I suppose we could give up on the attempt to be (somewhat) accurate >>>>>>>>>>>> and just disable the optimisation. Or we could restrict it to >>>>>>>>>>>> count > 1, >>>>>>>>>>>> since it might still be useful for gathers and scatters. >>>>>>>>>>> I tried restricting the calls to vect_is_store_elt_extraction to >>>>>>>>>>> count > 1 and it seems to resolve the issue of costing >>>>>>>>>>> vec_to_scalar operations with 0 (see patch below). >>>>>>>>>>> What are your thoughts on this? >>>>>>>>>> >>>>>>>>>> Why didn't you pursue instead moving the vec_to_scalar cost together >>>>>>>>>> with the n_adjacent_store handling? >>>>>>>>> When I continued working on this patch, we had already reached stage >>>>>>>>> 3 and I was hesitant to introduce changes to the middle-end that were >>>>>>>>> not previously covered by this patch. So I tried if the issue could >>>>>>>>> not be resolved by making a small change in the backend. >>>>>>>>> If you still advise to use the n_adjacent_store instead, I’m happy to >>>>>>>>> look into it again. >>>>>>>> >>>>>>>> If Richard's ok with adjusting vectorizable_store for GCC 15 (which it >>>>>>>> sounds like he is), then I agree that would be better. Otherwise we'd >>>>>>>> be creating technical debt to clean up for GCC 16. And it is a >>>>>>>> regression >>>>>>>> of sorts, so is stage 3 material from that POV. >>>>>>>> >>>>>>>> (Incidentally, AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS was itself a >>>>>>>> "let's clean this up next stage 1" thing, since we needed to add tuning >>>>>>>> for a new CPU late during the cycle. But of course, there were other >>>>>>>> priorities when stage 1 actually came around, so it never actually >>>>>>>> happened. Thanks again for being the one to sort this out.) >>>>>>> Thanks for your feedback. Then I will try to make it work in >>>>>>> vectorizable_store. >>>>>>> Best, >>>>>>> Jennifer >>>>>> Below is the updated patch with a suggestion for the changes in >>>>>> vectorizable_store. It resolves the issue with the vec_to_scalar >>>>>> operations that were individually costed with 0. >>>>>> We already tested it on aarch64, no regression, but we are still doing >>>>>> performance testing. >>>>>> Can you give some feedback in the meantime on the patch itself? >>>>>> Thanks, >>>>>> Jennifer >>>>>> >>>>>> >>>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable >>>>>> and >>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the >>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the >>>>>> default. To that end, the function aarch64_use_new_vector_costs_p and >>>>>> its uses >>>>>> were removed. To prevent costing vec_to_scalar operations with 0, as >>>>>> described in >>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html, >>>>>> we adjusted vectorizable_store such that the variable n_adjacent_stores >>>>>> also covers vec_to_scalar operations. This way vec_to_scalar operations >>>>>> are not costed individually, but as a group. >>>>>> >>>>>> Two tests were adjusted due to changes in codegen. In both cases, the >>>>>> old code performed loop unrolling once, but the new code does not: >>>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with >>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic >>>>>> -moverride=tune=none): >>>>>> f_int64_t_32: >>>>>> cbz w3, .L92 >>>>>> mov x4, 0 >>>>>> uxtw x3, w3 >>>>>> + cntd x5 >>>>>> + whilelo p7.d, xzr, x3 >>>>>> + mov z29.s, w5 >>>>>> mov z31.s, w2 >>>>>> - whilelo p6.d, xzr, x3 >>>>>> - mov x2, x3 >>>>>> - index z30.s, #0, #1 >>>>>> - uqdecd x2 >>>>>> - ptrue p5.b, all >>>>>> - whilelo p7.d, xzr, x2 >>>>>> + index z30.d, #0, #1 >>>>>> + ptrue p6.b, all >>>>>> .p2align 3,,7 >>>>>> .L94: >>>>>> - ld1d z27.d, p7/z, [x0, #1, mul vl] >>>>>> - ld1d z28.d, p6/z, [x0] >>>>>> - movprfx z29, z31 >>>>>> - mul z29.s, p5/m, z29.s, z30.s >>>>>> - incw x4 >>>>>> - uunpklo z0.d, z29.s >>>>>> - uunpkhi z29.d, z29.s >>>>>> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3] >>>>>> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3] >>>>>> - add z25.d, z28.d, z25.d >>>>>> + ld1d z27.d, p7/z, [x0, x4, lsl 3] >>>>>> + movprfx z28, z31 >>>>>> + mul z28.s, p6/m, z28.s, z30.s >>>>>> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3] >>>>>> add z26.d, z27.d, z26.d >>>>>> - st1d z26.d, p7, [x0, #1, mul vl] >>>>>> - whilelo p7.d, x4, x2 >>>>>> - st1d z25.d, p6, [x0] >>>>>> - incw z30.s >>>>>> - incb x0, all, mul #2 >>>>>> - whilelo p6.d, x4, x3 >>>>>> + st1d z26.d, p7, [x0, x4, lsl 3] >>>>>> + add z30.s, z30.s, z29.s >>>>>> + incd x4 >>>>>> + whilelo p7.d, x4, x3 >>>>>> b.any .L94 >>>>>> .L92: >>>>>> ret >>>>>> >>>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with >>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic >>>>>> -moverride=tune=none): >>>>>> f_int64_t_32: >>>>>> cbz w3, .L84 >>>>>> - addvl x5, x1, #1 >>>>>> mov x4, 0 >>>>>> uxtw x3, w3 >>>>>> - mov z31.s, w2 >>>>>> + cntd x5 >>>>>> whilelo p7.d, xzr, x3 >>>>>> - mov x2, x3 >>>>>> - index z30.s, #0, #1 >>>>>> - uqdecd x2 >>>>>> - ptrue p5.b, all >>>>>> - whilelo p6.d, xzr, x2 >>>>>> + mov z29.s, w5 >>>>>> + mov z31.s, w2 >>>>>> + index z30.d, #0, #1 >>>>>> + ptrue p6.b, all >>>>>> .p2align 3,,7 >>>>>> .L86: >>>>>> - ld1d z28.d, p7/z, [x1, x4, lsl 3] >>>>>> - ld1d z27.d, p6/z, [x5, x4, lsl 3] >>>>>> - movprfx z29, z30 >>>>>> - mul z29.s, p5/m, z29.s, z31.s >>>>>> - add z28.d, z28.d, #1 >>>>>> - uunpklo z26.d, z29.s >>>>>> - st1d z28.d, p7, [x0, z26.d, lsl 3] >>>>>> - incw x4 >>>>>> - uunpkhi z29.d, z29.s >>>>>> + ld1d z27.d, p7/z, [x1, x4, lsl 3] >>>>>> + movprfx z28, z30 >>>>>> + mul z28.s, p6/m, z28.s, z31.s >>>>>> add z27.d, z27.d, #1 >>>>>> - whilelo p6.d, x4, x2 >>>>>> - st1d z27.d, p7, [x0, z29.d, lsl 3] >>>>>> - incw z30.s >>>>>> + st1d z27.d, p7, [x0, z28.d, uxtw 3] >>>>>> + incd x4 >>>>>> + add z30.s, z30.s, z29.s >>>>>> whilelo p7.d, x4, x3 >>>>>> b.any .L86 >>>>>> .L84: >>>>>> ret >>>>>> >>>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no >>>>>> regression. >>>>>> OK for mainline? >>>>>> >>>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> >>>>>> >>>>>> gcc/ >>>>>> * tree-vect-stmts.cc (vectorizable_store): Extend the use of >>>>>> n_adjacent_stores to also cover vec_to_scalar operations. >>>>>> * config/aarch64/aarch64-tuning-flags.def: Remove >>>>>> use_new_vector_costs as tuning option. >>>>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p): >>>>>> Remove. >>>>>> (aarch64_vector_costs::add_stmt_cost): Remove use of >>>>>> aarch64_use_new_vector_costs_p. >>>>>> (aarch64_vector_costs::finish_cost): Remove use of >>>>>> aarch64_use_new_vector_costs_p. >>>>>> * config/aarch64/tuning_models/cortexx925.h: Remove >>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS. >>>>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise. >>>>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise. >>>>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise. >>>>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise. >>>>>> * config/aarch64/tuning_models/neoversen2.h: Likewise. >>>>>> * config/aarch64/tuning_models/neoversen3.h: Likewise. >>>>>> * config/aarch64/tuning_models/neoversev1.h: Likewise. >>>>>> * config/aarch64/tuning_models/neoversev2.h: Likewise. >>>>>> * config/aarch64/tuning_models/neoversev3.h: Likewise. >>>>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise. >>>>>> >>>>>> gcc/testsuite/ >>>>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome. >>>>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise. >>>>>> --- >>>>>> gcc/config/aarch64/aarch64-tuning-flags.def | 2 -- >>>>>> gcc/config/aarch64/aarch64.cc | 20 +++---------- >>>>>> gcc/config/aarch64/tuning_models/cortexx925.h | 1 - >>>>>> .../aarch64/tuning_models/fujitsu_monaka.h | 1 - >>>>>> .../aarch64/tuning_models/generic_armv8_a.h | 1 - >>>>>> .../aarch64/tuning_models/generic_armv9_a.h | 1 - >>>>>> .../aarch64/tuning_models/neoverse512tvb.h | 1 - >>>>>> gcc/config/aarch64/tuning_models/neoversen2.h | 1 - >>>>>> gcc/config/aarch64/tuning_models/neoversen3.h | 1 - >>>>>> gcc/config/aarch64/tuning_models/neoversev1.h | 1 - >>>>>> gcc/config/aarch64/tuning_models/neoversev2.h | 1 - >>>>>> gcc/config/aarch64/tuning_models/neoversev3.h | 1 - >>>>>> .../aarch64/tuning_models/neoversev3ae.h | 1 - >>>>>> .../gcc.target/aarch64/sve/strided_load_2.c | 2 +- >>>>>> .../gcc.target/aarch64/sve/strided_store_2.c | 2 +- >>>>>> gcc/tree-vect-stmts.cc | 29 ++++++++++--------- >>>>>> 16 files changed, 22 insertions(+), 44 deletions(-) >>>>>> >>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def >>>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def >>>>>> index ffbff20e29c..1de633c739b 100644 >>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def >>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def >>>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", >>>>>> CHEAP_SHIFT_EXTEND) >>>>>> >>>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", >>>>>> CSE_SVE_VL_CONSTANTS) >>>>>> >>>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", >>>>>> USE_NEW_VECTOR_COSTS) >>>>>> - >>>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", >>>>>> MATCHED_VECTOR_THROUGHPUT) >>>>>> >>>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", >>>>>> AVOID_CROSS_LOOP_FMA) >>>>>> diff --git a/gcc/config/aarch64/aarch64.cc >>>>>> b/gcc/config/aarch64/aarch64.cc >>>>>> index 77a2a6bfa3a..71fba9cc63b 100644 >>>>>> --- a/gcc/config/aarch64/aarch64.cc >>>>>> +++ b/gcc/config/aarch64/aarch64.cc >>>>>> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info >>>>>> *vinfo, bool costing_for_scalar) >>>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar); >>>>>> } >>>>>> >>>>>> -/* Return true if the current CPU should use the new costs defined >>>>>> - in GCC 11. This should be removed for GCC 12 and above, with the >>>>>> - costs applying to all CPUs instead. */ >>>>>> -static bool >>>>>> -aarch64_use_new_vector_costs_p () >>>>>> -{ >>>>>> - return (aarch64_tune_params.extra_tuning_flags >>>>>> - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS); >>>>>> -} >>>>>> - >>>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE. */ >>>>>> static const simd_vec_cost * >>>>>> aarch64_simd_vec_costs (tree vectype) >>>>>> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count, >>>>>> vect_cost_for_stmt kind, >>>>>> >>>>>> /* Do one-time initialization based on the vinfo. */ >>>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); >>>>>> - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ()) >>>>>> + if (!m_analyzed_vinfo) >>>>>> { >>>>>> if (loop_vinfo) >>>>>> analyze_loop_vinfo (loop_vinfo); >>>>>> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count, >>>>>> vect_cost_for_stmt kind, >>>>>> >>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead >>>>>> of just looking at KIND. */ >>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >>>>>> + if (stmt_info) >>>>>> { >>>>>> /* If we scalarize a strided store, the vectorizer costs one >>>>>> vec_to_scalar for each element. However, we can store the first >>>>>> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count, >>>>>> vect_cost_for_stmt kind, >>>>>> else >>>>>> m_num_last_promote_demote = 0; >>>>>> >>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >>>>>> + if (stmt_info) >>>>>> { >>>>>> /* Account for any extra "embedded" costs that apply additively >>>>>> to the base cost calculated above. */ >>>>>> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const >>>>>> vector_costs *uncast_scalar_costs) >>>>>> >>>>>> auto *scalar_costs >>>>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs); >>>>>> - if (loop_vinfo >>>>>> - && m_vec_flags >>>>>> - && aarch64_use_new_vector_costs_p ()) >>>>>> + if (loop_vinfo && m_vec_flags) >>>>>> { >>>>>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >>>>>> m_costs[vect_body]); >>>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h >>>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h >>>>>> index b2ff716157a..0a8eff69307 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params cortexx925_tunings = >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_BASE >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>>>> &generic_prefetch_tune, >>>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>>>> index 2d704ecd110..a564528f43d 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>>>> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings >>>>>> = >>>>>> 0, /* max_case_values. */ >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_BASE >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>>>> &generic_prefetch_tune, >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>>>> index bdd309ab03d..f090d5cde50 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>>>> @@ -183,7 +183,6 @@ static const struct tune_params >>>>>> generic_armv8_a_tunings = >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_BASE >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>>>> &generic_prefetch_tune, >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>>>> index a05a9ab92a2..4c33c147444 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>>>> @@ -249,7 +249,6 @@ static const struct tune_params >>>>>> generic_armv9_a_tunings = >>>>>> 0, /* max_case_values. */ >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_BASE >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>>>> &generic_armv9a_prefetch_tune, >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>>>> index c407b89a22f..fe4f7c10f73 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>>>> @@ -156,7 +156,6 @@ static const struct tune_params >>>>>> neoverse512tvb_tunings = >>>>>> 0, /* max_case_values. */ >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>>>> &generic_prefetch_tune, >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h >>>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h >>>>>> index fd5f8f37370..0c74068da2c 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings = >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_BASE >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>>>> &generic_prefetch_tune, >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h >>>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h >>>>>> index 8b156c2fe4d..9d4e1be171a 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings = >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_BASE >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>>>> &generic_prefetch_tune, >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h >>>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h >>>>>> index 23c121d8652..85a78bb2bef 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h >>>>>> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings = >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_BASE >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>>>> &generic_prefetch_tune, >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h >>>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h >>>>>> index 40af5f47f4f..1dd452beb8d 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h >>>>>> @@ -232,7 +232,6 @@ static const struct tune_params neoversev2_tunings = >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_BASE >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW >>>>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */ >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h >>>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h >>>>>> index d65d74bfecf..d0ba5b1aef6 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings = >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_BASE >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>>>> &generic_prefetch_tune, >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>>>> index 7b7fa0b4b08..a1572048503 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings >>>>>> = >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_BASE >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>>>> &generic_prefetch_tune, >>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>>>> index 762805ff54b..c334b7a6875 100644 >>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>>>> @@ -15,4 +15,4 @@ >>>>>> so we vectorize the offset calculation. This means that the >>>>>> 64-bit version needs two copies. */ >>>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, >>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ >>>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, >>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ >>>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, >>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ >>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>>>> index f0ea58e38e2..94cc63049bc 100644 >>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>>>> @@ -15,4 +15,4 @@ >>>>>> so we vectorize the offset calculation. This means that the >>>>>> 64-bit version needs two copies. */ >>>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], >>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ >>>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], >>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ >>>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], >>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ >>>>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc >>>>>> index be1139a423c..6d7d28c4702 100644 >>>>>> --- a/gcc/tree-vect-stmts.cc >>>>>> +++ b/gcc/tree-vect-stmts.cc >>>>>> @@ -8834,19 +8834,16 @@ vectorizable_store (vec_info *vinfo, >>>>>> { >>>>>> if (costing_p) >>>>>> { >>>>>> - /* Only need vector extracting when there are more >>>>>> - than one stores. */ >>>>>> - if (nstores > 1) >>>>>> - inside_cost >>>>>> - += record_stmt_cost (cost_vec, 1, >>>>>> vec_to_scalar, >>>>>> - stmt_info, slp_node, >>>>>> - 0, vect_body); >>>>>> /* Take a single lane vector type store as scalar >>>>>> store to avoid ICE like 110776. */ >>>>>> - if (VECTOR_TYPE_P (ltype) >>>>>> - && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) >>>>>> + bool single_lane_vec_p = >>>>>> + VECTOR_TYPE_P (ltype) >>>>>> + && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U); >>>>>> + /* Only need vector extracting when there are more >>>>>> + than one stores. */ >>>>>> + if (nstores > 1 || single_lane_vec_p) >>>>>> n_adjacent_stores++; >>>>>> - else >>>>>> + if (!single_lane_vec_p) >>>>> >>>>> I think it's somewhat non-obvious that nstores > 1 and single_lane_vec_p >>>>> correlate. In fact I think that we always record a store, just for >>>>> single-element >>>>> vectors we record scalar stores. I suggest to here always to just >>>>> n_adjacent_stores++ >>>>> and below ... >>>>> >>>>>> inside_cost >>>>>> += record_stmt_cost (cost_vec, 1, scalar_store, >>>>>> stmt_info, 0, vect_body); >>>>>> @@ -8905,9 +8902,15 @@ vectorizable_store (vec_info *vinfo, >>>>>> if (costing_p) >>>>>> { >>>>>> if (n_adjacent_stores > 0) >>>>>> - vect_get_store_cost (vinfo, stmt_info, slp_node, >>>>>> n_adjacent_stores, >>>>>> - alignment_support_scheme, misalignment, >>>>>> - &inside_cost, cost_vec); >>>>>> + { >>>>>> + vect_get_store_cost (vinfo, stmt_info, slp_node, >>>>>> n_adjacent_stores, >>>>>> + alignment_support_scheme, >>>>>> misalignment, >>>>>> + &inside_cost, cost_vec); >>>>> >>>>> ... record n_adjacent_stores scalar_store when ltype is single-lane and >>>>> record >>>>> n_adjacent_stores vect_to_scalar if nstores > 1 (and else none). >>>>> >>>>> Richard. >>>> Thanks for the feedback, I’m glad it’s going in the right direction. Below >>>> is the updated patch, re-validated on aarch64. >>>> Thanks, Jennifer >>>> >>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and >>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the >>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the >>>> default. To that end, the function aarch64_use_new_vector_costs_p and its >>>> uses >>>> were removed. To prevent costing vec_to_scalar operations with 0, as >>>> described in >>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html, >>>> we adjusted vectorizable_store such that the variable n_adjacent_stores >>>> also covers vec_to_scalar operations. This way vec_to_scalar operations >>>> are not costed individually, but as a group. >>>> >>>> Two tests were adjusted due to changes in codegen. In both cases, the >>>> old code performed loop unrolling once, but the new code does not: >>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with >>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic >>>> -moverride=tune=none): >>>> f_int64_t_32: >>>> cbz w3, .L92 >>>> mov x4, 0 >>>> uxtw x3, w3 >>>> + cntd x5 >>>> + whilelo p7.d, xzr, x3 >>>> + mov z29.s, w5 >>>> mov z31.s, w2 >>>> - whilelo p6.d, xzr, x3 >>>> - mov x2, x3 >>>> - index z30.s, #0, #1 >>>> - uqdecd x2 >>>> - ptrue p5.b, all >>>> - whilelo p7.d, xzr, x2 >>>> + index z30.d, #0, #1 >>>> + ptrue p6.b, all >>>> .p2align 3,,7 >>>> .L94: >>>> - ld1d z27.d, p7/z, [x0, #1, mul vl] >>>> - ld1d z28.d, p6/z, [x0] >>>> - movprfx z29, z31 >>>> - mul z29.s, p5/m, z29.s, z30.s >>>> - incw x4 >>>> - uunpklo z0.d, z29.s >>>> - uunpkhi z29.d, z29.s >>>> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3] >>>> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3] >>>> - add z25.d, z28.d, z25.d >>>> + ld1d z27.d, p7/z, [x0, x4, lsl 3] >>>> + movprfx z28, z31 >>>> + mul z28.s, p6/m, z28.s, z30.s >>>> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3] >>>> add z26.d, z27.d, z26.d >>>> - st1d z26.d, p7, [x0, #1, mul vl] >>>> - whilelo p7.d, x4, x2 >>>> - st1d z25.d, p6, [x0] >>>> - incw z30.s >>>> - incb x0, all, mul #2 >>>> - whilelo p6.d, x4, x3 >>>> + st1d z26.d, p7, [x0, x4, lsl 3] >>>> + add z30.s, z30.s, z29.s >>>> + incd x4 >>>> + whilelo p7.d, x4, x3 >>>> b.any .L94 >>>> .L92: >>>> ret >>>> >>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with >>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic >>>> -moverride=tune=none): >>>> f_int64_t_32: >>>> cbz w3, .L84 >>>> - addvl x5, x1, #1 >>>> mov x4, 0 >>>> uxtw x3, w3 >>>> - mov z31.s, w2 >>>> + cntd x5 >>>> whilelo p7.d, xzr, x3 >>>> - mov x2, x3 >>>> - index z30.s, #0, #1 >>>> - uqdecd x2 >>>> - ptrue p5.b, all >>>> - whilelo p6.d, xzr, x2 >>>> + mov z29.s, w5 >>>> + mov z31.s, w2 >>>> + index z30.d, #0, #1 >>>> + ptrue p6.b, all >>>> .p2align 3,,7 >>>> .L86: >>>> - ld1d z28.d, p7/z, [x1, x4, lsl 3] >>>> - ld1d z27.d, p6/z, [x5, x4, lsl 3] >>>> - movprfx z29, z30 >>>> - mul z29.s, p5/m, z29.s, z31.s >>>> - add z28.d, z28.d, #1 >>>> - uunpklo z26.d, z29.s >>>> - st1d z28.d, p7, [x0, z26.d, lsl 3] >>>> - incw x4 >>>> - uunpkhi z29.d, z29.s >>>> + ld1d z27.d, p7/z, [x1, x4, lsl 3] >>>> + movprfx z28, z30 >>>> + mul z28.s, p6/m, z28.s, z31.s >>>> add z27.d, z27.d, #1 >>>> - whilelo p6.d, x4, x2 >>>> - st1d z27.d, p7, [x0, z29.d, lsl 3] >>>> - incw z30.s >>>> + st1d z27.d, p7, [x0, z28.d, uxtw 3] >>>> + incd x4 >>>> + add z30.s, z30.s, z29.s >>>> whilelo p7.d, x4, x3 >>>> b.any .L86 >>>> .L84: >>>> ret >>>> >>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no >>>> regression. >>>> OK for mainline? >>>> >>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> >>>> >>>> gcc/ >>>> * tree-vect-stmts.cc (vectorizable_store): Extend the use of >>>> n_adjacent_stores to also cover vec_to_scalar operations. >>>> * config/aarch64/aarch64-tuning-flags.def: Remove >>>> use_new_vector_costs as tuning option. >>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p): >>>> Remove. >>>> (aarch64_vector_costs::add_stmt_cost): Remove use of >>>> aarch64_use_new_vector_costs_p. >>>> (aarch64_vector_costs::finish_cost): Remove use of >>>> aarch64_use_new_vector_costs_p. >>>> * config/aarch64/tuning_models/cortexx925.h: Remove >>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS. >>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise. >>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise. >>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise. >>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise. >>>> * config/aarch64/tuning_models/neoversen2.h: Likewise. >>>> * config/aarch64/tuning_models/neoversen3.h: Likewise. >>>> * config/aarch64/tuning_models/neoversev1.h: Likewise. >>>> * config/aarch64/tuning_models/neoversev2.h: Likewise. >>>> * config/aarch64/tuning_models/neoversev3.h: Likewise. >>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise. >>>> >>>> gcc/testsuite/ >>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome. >>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise. >>>> --- >>>> gcc/config/aarch64/aarch64-tuning-flags.def | 2 - >>>> gcc/config/aarch64/aarch64.cc | 20 ++-------- >>>> gcc/config/aarch64/tuning_models/cortexx925.h | 1 - >>>> .../aarch64/tuning_models/fujitsu_monaka.h | 1 - >>>> .../aarch64/tuning_models/generic_armv8_a.h | 1 - >>>> .../aarch64/tuning_models/generic_armv9_a.h | 1 - >>>> .../aarch64/tuning_models/neoverse512tvb.h | 1 - >>>> gcc/config/aarch64/tuning_models/neoversen2.h | 1 - >>>> gcc/config/aarch64/tuning_models/neoversen3.h | 1 - >>>> gcc/config/aarch64/tuning_models/neoversev1.h | 1 - >>>> gcc/config/aarch64/tuning_models/neoversev2.h | 1 - >>>> gcc/config/aarch64/tuning_models/neoversev3.h | 1 - >>>> .../aarch64/tuning_models/neoversev3ae.h | 1 - >>>> .../gcc.target/aarch64/sve/strided_load_2.c | 2 +- >>>> .../gcc.target/aarch64/sve/strided_store_2.c | 2 +- >>>> gcc/tree-vect-stmts.cc | 37 +++++++++++-------- >>>> 16 files changed, 27 insertions(+), 47 deletions(-) >>>> >>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def >>>> b/gcc/config/aarch64/aarch64-tuning-flags.def >>>> index ffbff20e29c..1de633c739b 100644 >>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def >>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def >>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", >>>> CHEAP_SHIFT_EXTEND) >>>> >>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS) >>>> >>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS) >>>> - >>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", >>>> MATCHED_VECTOR_THROUGHPUT) >>>> >>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA) >>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc >>>> index 77a2a6bfa3a..71fba9cc63b 100644 >>>> --- a/gcc/config/aarch64/aarch64.cc >>>> +++ b/gcc/config/aarch64/aarch64.cc >>>> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo, >>>> bool costing_for_scalar) >>>> return new aarch64_vector_costs (vinfo, costing_for_scalar); >>>> } >>>> >>>> -/* Return true if the current CPU should use the new costs defined >>>> - in GCC 11. This should be removed for GCC 12 and above, with the >>>> - costs applying to all CPUs instead. */ >>>> -static bool >>>> -aarch64_use_new_vector_costs_p () >>>> -{ >>>> - return (aarch64_tune_params.extra_tuning_flags >>>> - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS); >>>> -} >>>> - >>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE. */ >>>> static const simd_vec_cost * >>>> aarch64_simd_vec_costs (tree vectype) >>>> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count, >>>> vect_cost_for_stmt kind, >>>> >>>> /* Do one-time initialization based on the vinfo. */ >>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); >>>> - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ()) >>>> + if (!m_analyzed_vinfo) >>>> { >>>> if (loop_vinfo) >>>> analyze_loop_vinfo (loop_vinfo); >>>> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count, >>>> vect_cost_for_stmt kind, >>>> >>>> /* Try to get a more accurate cost by looking at STMT_INFO instead >>>> of just looking at KIND. */ >>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >>>> + if (stmt_info) >>>> { >>>> /* If we scalarize a strided store, the vectorizer costs one >>>> vec_to_scalar for each element. However, we can store the first >>>> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count, >>>> vect_cost_for_stmt kind, >>>> else >>>> m_num_last_promote_demote = 0; >>>> >>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >>>> + if (stmt_info) >>>> { >>>> /* Account for any extra "embedded" costs that apply additively >>>> to the base cost calculated above. */ >>>> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const >>>> vector_costs *uncast_scalar_costs) >>>> >>>> auto *scalar_costs >>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs); >>>> - if (loop_vinfo >>>> - && m_vec_flags >>>> - && aarch64_use_new_vector_costs_p ()) >>>> + if (loop_vinfo && m_vec_flags) >>>> { >>>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >>>> m_costs[vect_body]); >>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h >>>> b/gcc/config/aarch64/tuning_models/cortexx925.h >>>> index 5ebaf66e986..74772f3e15f 100644 >>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h >>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h >>>> @@ -221,7 +221,6 @@ static const struct tune_params cortexx925_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>> &generic_armv9a_prefetch_tune, >>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>> index 2d704ecd110..a564528f43d 100644 >>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings = >>>> 0, /* max_case_values. */ >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>> &generic_prefetch_tune, >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>> index bdd309ab03d..f090d5cde50 100644 >>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>> @@ -183,7 +183,6 @@ static const struct tune_params >>>> generic_armv8_a_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>> &generic_prefetch_tune, >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>> index 785e00946bc..7b5821183bc 100644 >>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>> @@ -251,7 +251,6 @@ static const struct tune_params >>>> generic_armv9_a_tunings = >>>> 0, /* max_case_values. */ >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>> &generic_armv9a_prefetch_tune, >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>> index 007f987154c..f7457df59e5 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>> @@ -156,7 +156,6 @@ static const struct tune_params neoverse512tvb_tunings >>>> = >>>> 0, /* max_case_values. */ >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>> &generic_armv9a_prefetch_tune, >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h >>>> b/gcc/config/aarch64/tuning_models/neoversen2.h >>>> index 32560d2f5f8..541b61c8179 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>> &generic_armv9a_prefetch_tune, >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h >>>> b/gcc/config/aarch64/tuning_models/neoversen3.h >>>> index 2010bc4645b..eff668132a8 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>> &generic_armv9a_prefetch_tune, >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h >>>> b/gcc/config/aarch64/tuning_models/neoversev1.h >>>> index c3751e32696..d11472b6e1e 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h >>>> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>> &generic_armv9a_prefetch_tune, >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h >>>> b/gcc/config/aarch64/tuning_models/neoversev2.h >>>> index 80dbe5c806c..ee77ffdd3bc 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev2_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW >>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h >>>> b/gcc/config/aarch64/tuning_models/neoversev3.h >>>> index efe09e16d1e..6ef143ef7d5 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>> &generic_armv9a_prefetch_tune, >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>> index 66849f30889..96bdbf971f1 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>> &generic_armv9a_prefetch_tune, >>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>> index 762805ff54b..c334b7a6875 100644 >>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>> @@ -15,4 +15,4 @@ >>>> so we vectorize the offset calculation. This means that the >>>> 64-bit version needs two copies. */ >>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, >>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ >>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ >>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ >>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>> index f0ea58e38e2..94cc63049bc 100644 >>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>> @@ -15,4 +15,4 @@ >>>> so we vectorize the offset calculation. This means that the >>>> 64-bit version needs two copies. */ >>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], >>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ >>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ >>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ >>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc >>>> index be1139a423c..ab57163c243 100644 >>>> --- a/gcc/tree-vect-stmts.cc >>>> +++ b/gcc/tree-vect-stmts.cc >>>> @@ -8834,19 +8834,8 @@ vectorizable_store (vec_info *vinfo, >>>> { >>>> if (costing_p) >>>> { >>>> - /* Only need vector extracting when there are more >>>> - than one stores. */ >>>> - if (nstores > 1) >>>> - inside_cost >>>> - += record_stmt_cost (cost_vec, 1, vec_to_scalar, >>>> - stmt_info, slp_node, >>>> - 0, vect_body); >>>> - /* Take a single lane vector type store as scalar >>>> - store to avoid ICE like 110776. */ >>>> - if (VECTOR_TYPE_P (ltype) >>>> - && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) >>>> - n_adjacent_stores++; >>>> - else >>>> + n_adjacent_stores++; >>>> + if (!VECTOR_TYPE_P (ltype)) >>> >>> This should be combined with the Single lane Vector case belle >>> >>>> inside_cost >>>> += record_stmt_cost (cost_vec, 1, scalar_store, >>>> stmt_info, 0, vect_body); >>>> @@ -8905,9 +8894,25 @@ vectorizable_store (vec_info *vinfo, >>>> if (costing_p) >>>> { >>>> if (n_adjacent_stores > 0) >>>> - vect_get_store_cost (vinfo, stmt_info, slp_node, >>>> n_adjacent_stores, >>>> - alignment_support_scheme, misalignment, >>>> - &inside_cost, cost_vec); >>>> + { >>>> + /* Take a single lane vector type store as scalar >>>> + store to avoid ICE like 110776. */ >>>> + if (VECTOR_TYPE_P (ltype) >>>> + && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) >>>> + inside_cost >>>> + += record_stmt_cost (cost_vec, n_adjacent_stores, >>>> + scalar_store, stmt_info, 0, vect_body); >>>> + /* Only need vector extracting when there are more >>>> + than one stores. */ >>>> + if (nstores > 1) >>>> + inside_cost >>>> + += record_stmt_cost (cost_vec, n_adjacent_stores, >>>> + vec_to_scalar, stmt_info, slp_node, >>>> + 0, vect_body); >>>> + vect_get_store_cost (vinfo, stmt_info, slp_node, >>> >>> This should be Inlay done for Multi-lane vectors >> Thanks for the quick reply. As I am making the changes, I am wondering: Do >> we even need n_adjacent_stores anymore? It appears to always have the same >> value as nstores. Can we remove it and use nstores instead or does it still >> serve another purpose? > > It was a heuristic needed for powerpc(?), can you confirm we’re not combining > stores from VF unrolling for strided SLP stores? Hi Richard, the reasoning behind my suggestion to replace n_adjacent_stores by nstores in this code section is that with my patch they will logically always have the same value.
Having said that, I looked into why n_adjacent_stores was introduced in the first place: The patch [1] that introduced n_adjacent_stores fixed a regression on aarch64 by costing vector loads/stores together. The variables n_adjacent_stores and n_adjacent_loads were added in two code sections each in vectorizable_store and vectorizable_load. The connection to PowerPC you recalled is also mentioned in the PR, but I believe it refers to the enum dr_alignment_support alignment_support_scheme that is used in vect_get_store_cost (vinfo, stmt_info, slp_node, _adjacent_stores, alignment_support_scheme, misalignment, &inside_cost, cost_vec); to which I made no changes other than refactoring the if-statement around it. So, taking the fact that n_adjacent_stores has been introduced in multiple locations into account I would actually leave n_adjacent_stores in the code section that I made changes to in order to keep vectorizable_store and vectorizable_load consistent. Regarding your question about not combining stores from loop unrolling for strided SLP stores: I'm not entirely sure what you mean, but could it be covered by the tests gcc.target/aarch64/ldp_stp_* that were also mentioned in [1]? I added the changes you proposed in the updated patch below, but kept n_adjacent_stores. The patch was re-validated on aarch64. Thanks, Jennifer [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111784#c3 This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and use_new_vector_costs entry in aarch64-tuning-flags.def and makes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the default. To that end, the function aarch64_use_new_vector_costs_p and its uses were removed. To prevent costing vec_to_scalar operations with 0, as described in https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html, we adjusted vectorizable_store such that the variable n_adjacent_stores also covers vec_to_scalar operations. This way vec_to_scalar operations are not costed individually, but as a group. Two tests were adjusted due to changes in codegen. In both cases, the old code performed loop unrolling once, but the new code does not: Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -moverride=tune=none): f_int64_t_32: cbz w3, .L92 mov x4, 0 uxtw x3, w3 + cntd x5 + whilelo p7.d, xzr, x3 + mov z29.s, w5 mov z31.s, w2 - whilelo p6.d, xzr, x3 - mov x2, x3 - index z30.s, #0, #1 - uqdecd x2 - ptrue p5.b, all - whilelo p7.d, xzr, x2 + index z30.d, #0, #1 + ptrue p6.b, all .p2align 3,,7 .L94: - ld1d z27.d, p7/z, [x0, #1, mul vl] - ld1d z28.d, p6/z, [x0] - movprfx z29, z31 - mul z29.s, p5/m, z29.s, z30.s - incw x4 - uunpklo z0.d, z29.s - uunpkhi z29.d, z29.s - ld1d z25.d, p6/z, [x1, z0.d, lsl 3] - ld1d z26.d, p7/z, [x1, z29.d, lsl 3] - add z25.d, z28.d, z25.d + ld1d z27.d, p7/z, [x0, x4, lsl 3] + movprfx z28, z31 + mul z28.s, p6/m, z28.s, z30.s + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3] add z26.d, z27.d, z26.d - st1d z26.d, p7, [x0, #1, mul vl] - whilelo p7.d, x4, x2 - st1d z25.d, p6, [x0] - incw z30.s - incb x0, all, mul #2 - whilelo p6.d, x4, x3 + st1d z26.d, p7, [x0, x4, lsl 3] + add z30.s, z30.s, z29.s + incd x4 + whilelo p7.d, x4, x3 b.any .L94 .L92: ret Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -moverride=tune=none): f_int64_t_32: cbz w3, .L84 - addvl x5, x1, #1 mov x4, 0 uxtw x3, w3 - mov z31.s, w2 + cntd x5 whilelo p7.d, xzr, x3 - mov x2, x3 - index z30.s, #0, #1 - uqdecd x2 - ptrue p5.b, all - whilelo p6.d, xzr, x2 + mov z29.s, w5 + mov z31.s, w2 + index z30.d, #0, #1 + ptrue p6.b, all .p2align 3,,7 .L86: - ld1d z28.d, p7/z, [x1, x4, lsl 3] - ld1d z27.d, p6/z, [x5, x4, lsl 3] - movprfx z29, z30 - mul z29.s, p5/m, z29.s, z31.s - add z28.d, z28.d, #1 - uunpklo z26.d, z29.s - st1d z28.d, p7, [x0, z26.d, lsl 3] - incw x4 - uunpkhi z29.d, z29.s + ld1d z27.d, p7/z, [x1, x4, lsl 3] + movprfx z28, z30 + mul z28.s, p6/m, z28.s, z31.s add z27.d, z27.d, #1 - whilelo p6.d, x4, x2 - st1d z27.d, p7, [x0, z29.d, lsl 3] - incw z30.s + st1d z27.d, p7, [x0, z28.d, uxtw 3] + incd x4 + add z30.s, z30.s, z29.s whilelo p7.d, x4, x3 b.any .L86 .L84: ret The patch was bootstrapped and tested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> gcc/ * tree-vect-stmts.cc (vectorizable_store): Extend the use of n_adjacent_stores to also cover vec_to_scalar operations. * config/aarch64/aarch64-tuning-flags.def: Remove use_new_vector_costs as tuning option. * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p): Remove. (aarch64_vector_costs::add_stmt_cost): Remove use of aarch64_use_new_vector_costs_p. (aarch64_vector_costs::finish_cost): Remove use of aarch64_use_new_vector_costs_p. * config/aarch64/tuning_models/cortexx925.h: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS. * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise. * config/aarch64/tuning_models/generic_armv8_a.h: Likewise. * config/aarch64/tuning_models/generic_armv9_a.h: Likewise. * config/aarch64/tuning_models/neoverse512tvb.h: Likewise. * config/aarch64/tuning_models/neoversen2.h: Likewise. * config/aarch64/tuning_models/neoversen3.h: Likewise. * config/aarch64/tuning_models/neoversev1.h: Likewise. * config/aarch64/tuning_models/neoversev2.h: Likewise. * config/aarch64/tuning_models/neoversev3.h: Likewise. * config/aarch64/tuning_models/neoversev3ae.h: Likewise. gcc/testsuite/ * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome. * gcc.target/aarch64/sve/strided_store_2.c: Likewise. --- gcc/config/aarch64/aarch64-tuning-flags.def | 2 - gcc/config/aarch64/aarch64.cc | 20 ++-------- gcc/config/aarch64/tuning_models/cortexx925.h | 1 - .../aarch64/tuning_models/fujitsu_monaka.h | 1 - .../aarch64/tuning_models/generic_armv8_a.h | 1 - .../aarch64/tuning_models/generic_armv9_a.h | 1 - .../aarch64/tuning_models/neoverse512tvb.h | 1 - gcc/config/aarch64/tuning_models/neoversen2.h | 1 - gcc/config/aarch64/tuning_models/neoversen3.h | 1 - gcc/config/aarch64/tuning_models/neoversev1.h | 1 - gcc/config/aarch64/tuning_models/neoversev2.h | 1 - gcc/config/aarch64/tuning_models/neoversev3.h | 1 - .../aarch64/tuning_models/neoversev3ae.h | 1 - .../gcc.target/aarch64/sve/strided_load_2.c | 2 +- .../gcc.target/aarch64/sve/strided_store_2.c | 2 +- gcc/tree-vect-stmts.cc | 40 ++++++++++--------- 16 files changed, 27 insertions(+), 50 deletions(-) diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def index ffbff20e29c..1de633c739b 100644 --- a/gcc/config/aarch64/aarch64-tuning-flags.def +++ b/gcc/config/aarch64/aarch64-tuning-flags.def @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND) AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS) -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS) - AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", MATCHED_VECTOR_THROUGHPUT) AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 77a2a6bfa3a..71fba9cc63b 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo, bool costing_for_scalar) return new aarch64_vector_costs (vinfo, costing_for_scalar); } -/* Return true if the current CPU should use the new costs defined - in GCC 11. This should be removed for GCC 12 and above, with the - costs applying to all CPUs instead. */ -static bool -aarch64_use_new_vector_costs_p () -{ - return (aarch64_tune_params.extra_tuning_flags - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS); -} - /* Return the appropriate SIMD costs for vectors of type VECTYPE. */ static const simd_vec_cost * aarch64_simd_vec_costs (tree vectype) @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, /* Do one-time initialization based on the vinfo. */ loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ()) + if (!m_analyzed_vinfo) { if (loop_vinfo) analyze_loop_vinfo (loop_vinfo); @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, /* Try to get a more accurate cost by looking at STMT_INFO instead of just looking at KIND. */ - if (stmt_info && aarch64_use_new_vector_costs_p ()) + if (stmt_info) { /* If we scalarize a strided store, the vectorizer costs one vec_to_scalar for each element. However, we can store the first @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, else m_num_last_promote_demote = 0; - if (stmt_info && aarch64_use_new_vector_costs_p ()) + if (stmt_info) { /* Account for any extra "embedded" costs that apply additively to the base cost calculated above. */ @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) auto *scalar_costs = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs); - if (loop_vinfo - && m_vec_flags - && aarch64_use_new_vector_costs_p ()) + if (loop_vinfo && m_vec_flags) { m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, m_costs[vect_body]); diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h b/gcc/config/aarch64/tuning_models/cortexx925.h index 5ebaf66e986..74772f3e15f 100644 --- a/gcc/config/aarch64/tuning_models/cortexx925.h +++ b/gcc/config/aarch64/tuning_models/cortexx925.h @@ -221,7 +221,6 @@ static const struct tune_params cortexx925_tunings = tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ (AARCH64_EXTRA_TUNE_BASE | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ &generic_armv9a_prefetch_tune, diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h index 2d704ecd110..a564528f43d 100644 --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings = 0, /* max_case_values. */ tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ (AARCH64_EXTRA_TUNE_BASE - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ &generic_prefetch_tune, AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h b/gcc/config/aarch64/tuning_models/generic_armv8_a.h index bdd309ab03d..f090d5cde50 100644 --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h @@ -183,7 +183,6 @@ static const struct tune_params generic_armv8_a_tunings = tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ (AARCH64_EXTRA_TUNE_BASE | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ &generic_prefetch_tune, AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h b/gcc/config/aarch64/tuning_models/generic_armv9_a.h index 785e00946bc..7b5821183bc 100644 --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h @@ -251,7 +251,6 @@ static const struct tune_params generic_armv9_a_tunings = 0, /* max_case_values. */ tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ (AARCH64_EXTRA_TUNE_BASE - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ &generic_armv9a_prefetch_tune, AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h b/gcc/config/aarch64/tuning_models/neoverse512tvb.h index 007f987154c..f7457df59e5 100644 --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h @@ -156,7 +156,6 @@ static const struct tune_params neoverse512tvb_tunings = 0, /* max_case_values. */ tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ &generic_armv9a_prefetch_tune, AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h b/gcc/config/aarch64/tuning_models/neoversen2.h index 32560d2f5f8..541b61c8179 100644 --- a/gcc/config/aarch64/tuning_models/neoversen2.h +++ b/gcc/config/aarch64/tuning_models/neoversen2.h @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings = tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ (AARCH64_EXTRA_TUNE_BASE | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ &generic_armv9a_prefetch_tune, diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h b/gcc/config/aarch64/tuning_models/neoversen3.h index 2010bc4645b..eff668132a8 100644 --- a/gcc/config/aarch64/tuning_models/neoversen3.h +++ b/gcc/config/aarch64/tuning_models/neoversen3.h @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings = tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ (AARCH64_EXTRA_TUNE_BASE | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ &generic_armv9a_prefetch_tune, AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h b/gcc/config/aarch64/tuning_models/neoversev1.h index c3751e32696..d11472b6e1e 100644 --- a/gcc/config/aarch64/tuning_models/neoversev1.h +++ b/gcc/config/aarch64/tuning_models/neoversev1.h @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings = tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ (AARCH64_EXTRA_TUNE_BASE | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ &generic_armv9a_prefetch_tune, diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h b/gcc/config/aarch64/tuning_models/neoversev2.h index 80dbe5c806c..ee77ffdd3bc 100644 --- a/gcc/config/aarch64/tuning_models/neoversev2.h +++ b/gcc/config/aarch64/tuning_models/neoversev2.h @@ -219,7 +219,6 @@ static const struct tune_params neoversev2_tunings = tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ (AARCH64_EXTRA_TUNE_BASE | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */ diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h b/gcc/config/aarch64/tuning_models/neoversev3.h index efe09e16d1e..6ef143ef7d5 100644 --- a/gcc/config/aarch64/tuning_models/neoversev3.h +++ b/gcc/config/aarch64/tuning_models/neoversev3.h @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings = tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ (AARCH64_EXTRA_TUNE_BASE | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ &generic_armv9a_prefetch_tune, diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h b/gcc/config/aarch64/tuning_models/neoversev3ae.h index 66849f30889..96bdbf971f1 100644 --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings = tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ (AARCH64_EXTRA_TUNE_BASE | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ &generic_armv9a_prefetch_tune, diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c index 762805ff54b..c334b7a6875 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c @@ -15,4 +15,4 @@ so we vectorize the offset calculation. This means that the 64-bit version needs two copies. */ /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c index f0ea58e38e2..94cc63049bc 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c @@ -15,4 +15,4 @@ so we vectorize the offset calculation. This means that the 64-bit version needs two copies. */ /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index be1139a423c..a14248193ca 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -8834,22 +8834,7 @@ vectorizable_store (vec_info *vinfo, { if (costing_p) { - /* Only need vector extracting when there are more - than one stores. */ - if (nstores > 1) - inside_cost - += record_stmt_cost (cost_vec, 1, vec_to_scalar, - stmt_info, slp_node, - 0, vect_body); - /* Take a single lane vector type store as scalar - store to avoid ICE like 110776. */ - if (VECTOR_TYPE_P (ltype) - && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) - n_adjacent_stores++; - else - inside_cost - += record_stmt_cost (cost_vec, 1, scalar_store, - stmt_info, 0, vect_body); + n_adjacent_stores++; continue; } tree newref, newoff; @@ -8905,9 +8890,26 @@ vectorizable_store (vec_info *vinfo, if (costing_p) { if (n_adjacent_stores > 0) - vect_get_store_cost (vinfo, stmt_info, slp_node, n_adjacent_stores, - alignment_support_scheme, misalignment, - &inside_cost, cost_vec); + { + /* Take a single lane vector type store as scalar + store to avoid ICE like 110776. */ + if (VECTOR_TYPE_P (ltype) + && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) + vect_get_store_cost (vinfo, stmt_info, slp_node, + n_adjacent_stores, alignment_support_scheme, + misalignment, &inside_cost, cost_vec); + else + inside_cost + += record_stmt_cost (cost_vec, n_adjacent_stores, + scalar_store, stmt_info, 0, vect_body); + /* Only need vector extracting when there are more + than one stores. */ + if (nstores > 1) + inside_cost + += record_stmt_cost (cost_vec, n_adjacent_stores, + vec_to_scalar, stmt_info, slp_node, + 0, vect_body); + } if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "vect_model_store_cost: inside_cost = %d, " -- 2.44.0 > >> Thanks, Jennifer >>> >>>> + n_adjacent_stores, alignment_support_scheme, >>>> + misalignment, &inside_cost, cost_vec); >>>> + } >>>> if (dump_enabled_p ()) >>>> dump_printf_loc (MSG_NOTE, vect_location, >>>> "vect_model_store_cost: inside_cost = %d, " >>>> -- >>>> 2.34.1 >>>>> >>>>>> + inside_cost >>>>>> + += record_stmt_cost (cost_vec, n_adjacent_stores, >>>>>> vec_to_scalar, >>>>>> + stmt_info, slp_node, >>>>>> + 0, vect_body); >>>>>> + } >>>>>> if (dump_enabled_p ()) >>>>>> dump_printf_loc (MSG_NOTE, vect_location, >>>>>> "vect_model_store_cost: inside_cost = %d, " >>>>>> -- >>>>>> 2.44.0 >>>>>> >>>>>> >>>>>>>> >>>>>>>> Richard >>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Jennifer >>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Jennifer >>>>>>>>>>> >>>>>>>>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>>>> tunable and >>>>>>>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the >>>>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the >>>>>>>>>>> default. To that end, the function aarch64_use_new_vector_costs_p >>>>>>>>>>> and its uses >>>>>>>>>>> were removed. To prevent costing vec_to_scalar operations with 0, as >>>>>>>>>>> described in >>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html, >>>>>>>>>>> we guarded the call to vect_is_store_elt_extraction in >>>>>>>>>>> aarch64_vector_costs::add_stmt_cost by count > 1. >>>>>>>>>>> >>>>>>>>>>> Two tests were adjusted due to changes in codegen. In both cases, >>>>>>>>>>> the >>>>>>>>>>> old code performed loop unrolling once, but the new code does not: >>>>>>>>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with >>>>>>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic >>>>>>>>>>> -moverride=tune=none): >>>>>>>>>>> f_int64_t_32: >>>>>>>>>>> cbz w3, .L92 >>>>>>>>>>> mov x4, 0 >>>>>>>>>>> uxtw x3, w3 >>>>>>>>>>> + cntd x5 >>>>>>>>>>> + whilelo p7.d, xzr, x3 >>>>>>>>>>> + mov z29.s, w5 >>>>>>>>>>> mov z31.s, w2 >>>>>>>>>>> - whilelo p6.d, xzr, x3 >>>>>>>>>>> - mov x2, x3 >>>>>>>>>>> - index z30.s, #0, #1 >>>>>>>>>>> - uqdecd x2 >>>>>>>>>>> - ptrue p5.b, all >>>>>>>>>>> - whilelo p7.d, xzr, x2 >>>>>>>>>>> + index z30.d, #0, #1 >>>>>>>>>>> + ptrue p6.b, all >>>>>>>>>>> .p2align 3,,7 >>>>>>>>>>> .L94: >>>>>>>>>>> - ld1d z27.d, p7/z, [x0, #1, mul vl] >>>>>>>>>>> - ld1d z28.d, p6/z, [x0] >>>>>>>>>>> - movprfx z29, z31 >>>>>>>>>>> - mul z29.s, p5/m, z29.s, z30.s >>>>>>>>>>> - incw x4 >>>>>>>>>>> - uunpklo z0.d, z29.s >>>>>>>>>>> - uunpkhi z29.d, z29.s >>>>>>>>>>> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3] >>>>>>>>>>> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3] >>>>>>>>>>> - add z25.d, z28.d, z25.d >>>>>>>>>>> + ld1d z27.d, p7/z, [x0, x4, lsl 3] >>>>>>>>>>> + movprfx z28, z31 >>>>>>>>>>> + mul z28.s, p6/m, z28.s, z30.s >>>>>>>>>>> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3] >>>>>>>>>>> add z26.d, z27.d, z26.d >>>>>>>>>>> - st1d z26.d, p7, [x0, #1, mul vl] >>>>>>>>>>> - whilelo p7.d, x4, x2 >>>>>>>>>>> - st1d z25.d, p6, [x0] >>>>>>>>>>> - incw z30.s >>>>>>>>>>> - incb x0, all, mul #2 >>>>>>>>>>> - whilelo p6.d, x4, x3 >>>>>>>>>>> + st1d z26.d, p7, [x0, x4, lsl 3] >>>>>>>>>>> + add z30.s, z30.s, z29.s >>>>>>>>>>> + incd x4 >>>>>>>>>>> + whilelo p7.d, x4, x3 >>>>>>>>>>> b.any .L94 >>>>>>>>>>> .L92: >>>>>>>>>>> ret >>>>>>>>>>> >>>>>>>>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with >>>>>>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic >>>>>>>>>>> -moverride=tune=none): >>>>>>>>>>> f_int64_t_32: >>>>>>>>>>> cbz w3, .L84 >>>>>>>>>>> - addvl x5, x1, #1 >>>>>>>>>>> mov x4, 0 >>>>>>>>>>> uxtw x3, w3 >>>>>>>>>>> - mov z31.s, w2 >>>>>>>>>>> + cntd x5 >>>>>>>>>>> whilelo p7.d, xzr, x3 >>>>>>>>>>> - mov x2, x3 >>>>>>>>>>> - index z30.s, #0, #1 >>>>>>>>>>> - uqdecd x2 >>>>>>>>>>> - ptrue p5.b, all >>>>>>>>>>> - whilelo p6.d, xzr, x2 >>>>>>>>>>> + mov z29.s, w5 >>>>>>>>>>> + mov z31.s, w2 >>>>>>>>>>> + index z30.d, #0, #1 >>>>>>>>>>> + ptrue p6.b, all >>>>>>>>>>> .p2align 3,,7 >>>>>>>>>>> .L86: >>>>>>>>>>> - ld1d z28.d, p7/z, [x1, x4, lsl 3] >>>>>>>>>>> - ld1d z27.d, p6/z, [x5, x4, lsl 3] >>>>>>>>>>> - movprfx z29, z30 >>>>>>>>>>> - mul z29.s, p5/m, z29.s, z31.s >>>>>>>>>>> - add z28.d, z28.d, #1 >>>>>>>>>>> - uunpklo z26.d, z29.s >>>>>>>>>>> - st1d z28.d, p7, [x0, z26.d, lsl 3] >>>>>>>>>>> - incw x4 >>>>>>>>>>> - uunpkhi z29.d, z29.s >>>>>>>>>>> + ld1d z27.d, p7/z, [x1, x4, lsl 3] >>>>>>>>>>> + movprfx z28, z30 >>>>>>>>>>> + mul z28.s, p6/m, z28.s, z31.s >>>>>>>>>>> add z27.d, z27.d, #1 >>>>>>>>>>> - whilelo p6.d, x4, x2 >>>>>>>>>>> - st1d z27.d, p7, [x0, z29.d, lsl 3] >>>>>>>>>>> - incw z30.s >>>>>>>>>>> + st1d z27.d, p7, [x0, z28.d, uxtw 3] >>>>>>>>>>> + incd x4 >>>>>>>>>>> + add z30.s, z30.s, z29.s >>>>>>>>>>> whilelo p7.d, x4, x3 >>>>>>>>>>> b.any .L86 >>>>>>>>>>> .L84: >>>>>>>>>>> ret >>>>>>>>>>> >>>>>>>>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no >>>>>>>>>>> regression. We also ran SPEC2017 with -mcpu=generic on a Grace >>>>>>>>>>> machine and saw >>>>>>>>>>> no non-noise impact on performance. We would appreciate help with >>>>>>>>>>> wider >>>>>>>>>>> benchmarking on other platforms, if necessary. >>>>>>>>>>> OK for mainline? >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> >>>>>>>>>>> >>>>>>>>>>> gcc/ >>>>>>>>>>> * config/aarch64/aarch64-tuning-flags.def: Remove >>>>>>>>>>> use_new_vector_costs as tuning option. >>>>>>>>>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p): >>>>>>>>>>> Remove. >>>>>>>>>>> (aarch64_vector_costs::add_stmt_cost): Remove use of >>>>>>>>>>> aarch64_use_new_vector_costs_p and guard call to >>>>>>>>>>> vect_is_store_elt_extraction with count > 1. >>>>>>>>>>> (aarch64_vector_costs::finish_cost): Remove use of >>>>>>>>>>> aarch64_use_new_vector_costs_p. >>>>>>>>>>> * config/aarch64/tuning_models/cortexx925.h: Remove >>>>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS. >>>>>>>>>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise. >>>>>>>>>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise. >>>>>>>>>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise. >>>>>>>>>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise. >>>>>>>>>>> * config/aarch64/tuning_models/neoversen2.h: Likewise. >>>>>>>>>>> * config/aarch64/tuning_models/neoversen3.h: Likewise. >>>>>>>>>>> * config/aarch64/tuning_models/neoversev1.h: Likewise. >>>>>>>>>>> * config/aarch64/tuning_models/neoversev2.h: Likewise. >>>>>>>>>>> * config/aarch64/tuning_models/neoversev3.h: Likewise. >>>>>>>>>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise. >>>>>>>>>>> >>>>>>>>>>> gcc/testsuite/ >>>>>>>>>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome. >>>>>>>>>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise. >>>>>>>>>>> --- >>>>>>>>>>> gcc/config/aarch64/aarch64-tuning-flags.def | 2 -- >>>>>>>>>>> gcc/config/aarch64/aarch64.cc | 22 >>>>>>>>>>> +++++-------------- >>>>>>>>>>> gcc/config/aarch64/tuning_models/cortexx925.h | 1 - >>>>>>>>>>> .../aarch64/tuning_models/fujitsu_monaka.h | 1 - >>>>>>>>>>> .../aarch64/tuning_models/generic_armv8_a.h | 1 - >>>>>>>>>>> .../aarch64/tuning_models/generic_armv9_a.h | 1 - >>>>>>>>>>> .../aarch64/tuning_models/neoverse512tvb.h | 1 - >>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversen2.h | 1 - >>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversen3.h | 1 - >>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev1.h | 1 - >>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev2.h | 1 - >>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev3.h | 1 - >>>>>>>>>>> .../aarch64/tuning_models/neoversev3ae.h | 1 - >>>>>>>>>>> .../gcc.target/aarch64/sve/strided_load_2.c | 2 +- >>>>>>>>>>> .../gcc.target/aarch64/sve/strided_store_2.c | 2 +- >>>>>>>>>>> 15 files changed, 7 insertions(+), 32 deletions(-) >>>>>>>>>>> >>>>>>>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def >>>>>>>>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def >>>>>>>>>>> index 5939602576b..ed345b13ed3 100644 >>>>>>>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def >>>>>>>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def >>>>>>>>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION >>>>>>>>>>> ("cheap_shift_extend", CHEAP_SHIFT_EXTEND) >>>>>>>>>>> >>>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", >>>>>>>>>>> CSE_SVE_VL_CONSTANTS) >>>>>>>>>>> >>>>>>>>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", >>>>>>>>>>> USE_NEW_VECTOR_COSTS) >>>>>>>>>>> - >>>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", >>>>>>>>>>> MATCHED_VECTOR_THROUGHPUT) >>>>>>>>>>> >>>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", >>>>>>>>>>> AVOID_CROSS_LOOP_FMA) >>>>>>>>>>> diff --git a/gcc/config/aarch64/aarch64.cc >>>>>>>>>>> b/gcc/config/aarch64/aarch64.cc >>>>>>>>>>> index 43238aefef2..03806671c97 100644 >>>>>>>>>>> --- a/gcc/config/aarch64/aarch64.cc >>>>>>>>>>> +++ b/gcc/config/aarch64/aarch64.cc >>>>>>>>>>> @@ -16566,16 +16566,6 @@ aarch64_vectorize_create_costs (vec_info >>>>>>>>>>> *vinfo, bool costing_for_scalar) >>>>>>>>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar); >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> -/* Return true if the current CPU should use the new costs defined >>>>>>>>>>> - in GCC 11. This should be removed for GCC 12 and above, with >>>>>>>>>>> the >>>>>>>>>>> - costs applying to all CPUs instead. */ >>>>>>>>>>> -static bool >>>>>>>>>>> -aarch64_use_new_vector_costs_p () >>>>>>>>>>> -{ >>>>>>>>>>> - return (aarch64_tune_params.extra_tuning_flags >>>>>>>>>>> - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS); >>>>>>>>>>> -} >>>>>>>>>>> - >>>>>>>>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE. >>>>>>>>>>> */ >>>>>>>>>>> static const simd_vec_cost * >>>>>>>>>>> aarch64_simd_vec_costs (tree vectype) >>>>>>>>>>> @@ -17494,7 +17484,7 @@ aarch64_vector_costs::add_stmt_cost (int >>>>>>>>>>> count, vect_cost_for_stmt kind, >>>>>>>>>>> >>>>>>>>>>> /* Do one-time initialization based on the vinfo. */ >>>>>>>>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); >>>>>>>>>>> - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ()) >>>>>>>>>>> + if (!m_analyzed_vinfo) >>>>>>>>>>> { >>>>>>>>>>> if (loop_vinfo) >>>>>>>>>>> analyze_loop_vinfo (loop_vinfo); >>>>>>>>>>> @@ -17512,12 +17502,12 @@ aarch64_vector_costs::add_stmt_cost (int >>>>>>>>>>> count, vect_cost_for_stmt kind, >>>>>>>>>>> >>>>>>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead >>>>>>>>>>> of just looking at KIND. */ >>>>>>>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >>>>>>>>>>> + if (stmt_info) >>>>>>>>>>> { >>>>>>>>>>> /* If we scalarize a strided store, the vectorizer costs one >>>>>>>>>>> vec_to_scalar for each element. However, we can store the first >>>>>>>>>>> element using an FP store without a separate extract step. */ >>>>>>>>>>> - if (vect_is_store_elt_extraction (kind, stmt_info)) >>>>>>>>>>> + if (vect_is_store_elt_extraction (kind, stmt_info) && count >>>>>>>>>>> > 1) >>>>>>>>>>> count -= 1; >>>>>>>>>>> >>>>>>>>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind, >>>>>>>>>>> @@ -17577,7 +17567,7 @@ aarch64_vector_costs::add_stmt_cost (int >>>>>>>>>>> count, vect_cost_for_stmt kind, >>>>>>>>>>> else >>>>>>>>>>> m_num_last_promote_demote = 0; >>>>>>>>>>> >>>>>>>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >>>>>>>>>>> + if (stmt_info) >>>>>>>>>>> { >>>>>>>>>>> /* Account for any extra "embedded" costs that apply additively >>>>>>>>>>> to the base cost calculated above. */ >>>>>>>>>>> @@ -17938,9 +17928,7 @@ aarch64_vector_costs::finish_cost (const >>>>>>>>>>> vector_costs *uncast_scalar_costs) >>>>>>>>>>> >>>>>>>>>>> auto *scalar_costs >>>>>>>>>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs); >>>>>>>>>>> - if (loop_vinfo >>>>>>>>>>> - && m_vec_flags >>>>>>>>>>> - && aarch64_use_new_vector_costs_p ()) >>>>>>>>>>> + if (loop_vinfo && m_vec_flags) >>>>>>>>>>> { >>>>>>>>>>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >>>>>>>>>>> m_costs[vect_body]); >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h >>>>>>>>>>> index eb9b89984b0..dafea96e924 100644 >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h >>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params >>>>>>>>>>> cortexx925_tunings = >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>>>>>>>>> &generic_prefetch_tune, >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>>>>>>>>> index 6a098497759..ac001927959 100644 >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>>>>>>>>> @@ -55,7 +55,6 @@ static const struct tune_params >>>>>>>>>>> fujitsu_monaka_tunings = >>>>>>>>>>> 0, /* max_case_values. */ >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>>>>>>>>> &generic_prefetch_tune, >>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>>>>>>>>> index 9b1cbfc5bd2..7b534831340 100644 >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>>>>>>>>> @@ -183,7 +183,6 @@ static const struct tune_params >>>>>>>>>>> generic_armv8_a_tunings = >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>>>>>>>>> &generic_prefetch_tune, >>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>>>>>>>>> index 48353a59939..562ef89c67b 100644 >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>>>>>>>>> @@ -249,7 +249,6 @@ static const struct tune_params >>>>>>>>>>> generic_armv9_a_tunings = >>>>>>>>>>> 0, /* max_case_values. */ >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>>>>>>>>> &generic_armv9a_prefetch_tune, >>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>>>>>>>>> index c407b89a22f..fe4f7c10f73 100644 >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>>>>>>>>> @@ -156,7 +156,6 @@ static const struct tune_params >>>>>>>>>>> neoverse512tvb_tunings = >>>>>>>>>>> 0, /* max_case_values. */ >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>>>>>>>>> &generic_prefetch_tune, >>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h >>>>>>>>>>> index 18199ac206c..56be77423cb 100644 >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h >>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params >>>>>>>>>>> neoversen2_tunings = >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>>>>>>>>> &generic_prefetch_tune, >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h >>>>>>>>>>> index 4da85cfac0d..254ad5e27f8 100644 >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h >>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params >>>>>>>>>>> neoversen3_tunings = >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>>>>>>>>> &generic_prefetch_tune, >>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h >>>>>>>>>>> index dd9120eee48..c7241cf23d7 100644 >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h >>>>>>>>>>> @@ -227,7 +227,6 @@ static const struct tune_params >>>>>>>>>>> neoversev1_tunings = >>>>>>>>>>> 0, /* max_case_values. */ >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h >>>>>>>>>>> index 1369de73991..96f55940649 100644 >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h >>>>>>>>>>> @@ -232,7 +232,6 @@ static const struct tune_params >>>>>>>>>>> neoversev2_tunings = >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW >>>>>>>>>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */ >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h >>>>>>>>>>> index d8c82255378..f62ae67d355 100644 >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h >>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params >>>>>>>>>>> neoversev3_tunings = >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>>>>>>>>> &generic_prefetch_tune, >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>>>>>>>>> index 7f050501ede..0233baf5e34 100644 >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params >>>>>>>>>>> neoversev3ae_tunings = >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>>>>>>>>> &generic_prefetch_tune, >>>>>>>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>>>>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>>>>>>>>> index 762805ff54b..c334b7a6875 100644 >>>>>>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>>>>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>>>>>>>>> @@ -15,4 +15,4 @@ >>>>>>>>>>> so we vectorize the offset calculation. This means that the >>>>>>>>>>> 64-bit version needs two copies. */ >>>>>>>>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, >>>>>>>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ >>>>>>>>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, >>>>>>>>>>> p[0-7]/z, \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ >>>>>>>>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, >>>>>>>>>>> p[0-7]/z, \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ >>>>>>>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>>>>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>>>>>>>>> index f0ea58e38e2..94cc63049bc 100644 >>>>>>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>>>>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>>>>>>>>> @@ -15,4 +15,4 @@ >>>>>>>>>>> so we vectorize the offset calculation. This means that the >>>>>>>>>>> 64-bit version needs two copies. */ >>>>>>>>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], >>>>>>>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ >>>>>>>>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], >>>>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ >>>>>>>>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], >>>>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Richard Biener <rguent...@suse.de> >>>>>>>>>> SUSE Software Solutions Germany GmbH, >>>>>>>>>> Frankenstrasse 146, 90461 Nuernberg, Germany; >>>>>>>>>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG >>>>>>>>>> Nuernberg)
smime.p7s
Description: S/MIME cryptographic signature