> Am 13.12.2024 um 18:00 schrieb Jennifer Schmitz <jschm...@nvidia.com>:
>
>
>
>> On 13 Dec 2024, at 13:40, Richard Biener <richard.guent...@gmail.com> wrote:
>>
>> External email: Use caution opening links or attachments
>>
>>
>>> On Thu, Dec 12, 2024 at 5:27 PM Jennifer Schmitz <jschm...@nvidia.com>
>>> wrote:
>>>
>>>
>>>
>>>> On 6 Dec 2024, at 08:41, Jennifer Schmitz <jschm...@nvidia.com> wrote:
>>>>
>>>>
>>>>
>>>>> On 5 Dec 2024, at 20:07, Richard Sandiford <richard.sandif...@arm.com>
>>>>> wrote:
>>>>>
>>>>> External email: Use caution opening links or attachments
>>>>>
>>>>>
>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>>>>>> On 5 Dec 2024, at 11:44, Richard Biener <rguent...@suse.de> wrote:
>>>>>>>
>>>>>>> External email: Use caution opening links or attachments
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 5 Dec 2024, Jennifer Schmitz wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 17 Oct 2024, at 19:23, Richard Sandiford
>>>>>>>>> <richard.sandif...@arm.com> wrote:
>>>>>>>>>
>>>>>>>>> External email: Use caution opening links or attachments
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>>>>>>>>> [...]
>>>>>>>>>> Looking at the diff of the vect dumps (below is a section of the
>>>>>>>>>> diff for strided_store_2.c), it seemed odd that vec_to_scalar
>>>>>>>>>> operations cost 0 now, instead of the previous cost of 2:
>>>>>>>>>>
>>>>>>>>>> +strided_store_1.c:38:151: note: === vectorizable_operation ===
>>>>>>>>>> +strided_store_1.c:38:151: note: vect_model_simple_cost:
>>>>>>>>>> inside_cost = 1, prologue_cost = 0 .
>>>>>>>>>> +strided_store_1.c:38:151: note: ==> examining statement: *_6 = _7;
>>>>>>>>>> +strided_store_1.c:38:151: note: vect_is_simple_use: operand _3 +
>>>>>>>>>> 1.0e+0, type of def: internal
>>>>>>>>>> +strided_store_1.c:38:151: note: Vectorizing an unaligned access.
>>>>>>>>>> +Applying pattern match.pd:236, generic-match-9.cc:4128
>>>>>>>>>> +Applying pattern match.pd:5285, generic-match-10.cc:4234
>>>>>>>>>> +strided_store_1.c:38:151: note: vect_model_store_cost:
>>>>>>>>>> inside_cost = 12, prologue_cost = 0 .
>>>>>>>>>> *_2 1 times unaligned_load (misalign -1) costs 1 in body
>>>>>>>>>> -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue
>>>>>>>>>> _3 + 1.0e+0 1 times vector_stmt costs 1 in body
>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>>>>>> +<unknown> 1 times vector_load costs 1 in prologue
>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>>>>>>
>>>>>>>>>> Although the aarch64_use_new_vector_costs_p flag was used in
>>>>>>>>>> multiple places in aarch64.cc, the location that causes this
>>>>>>>>>> behavior is this one:
>>>>>>>>>> unsigned
>>>>>>>>>> aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt
>>>>>>>>>> kind,
>>>>>>>>>> stmt_vec_info stmt_info, slp_tree,
>>>>>>>>>> tree vectype, int misalign,
>>>>>>>>>> vect_cost_model_location where)
>>>>>>>>>> {
>>>>>>>>>> [...]
>>>>>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
>>>>>>>>>> of just looking at KIND. */
>>>>>>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>>>>>>> + if (stmt_info)
>>>>>>>>>> {
>>>>>>>>>> /* If we scalarize a strided store, the vectorizer costs one
>>>>>>>>>> vec_to_scalar for each element. However, we can store the first
>>>>>>>>>> element using an FP store without a separate extract step. */
>>>>>>>>>> if (vect_is_store_elt_extraction (kind, stmt_info))
>>>>>>>>>> count -= 1;
>>>>>>>>>>
>>>>>>>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
>>>>>>>>>> stmt_info,
>>>>>>>>>> stmt_cost);
>>>>>>>>>>
>>>>>>>>>> if (vectype && m_vec_flags)
>>>>>>>>>> stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind,
>>>>>>>>>> stmt_info, vectype,
>>>>>>>>>> where, stmt_cost);
>>>>>>>>>> }
>>>>>>>>>> [...]
>>>>>>>>>> return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil
>>>>>>>>>> ());
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> Previously, for mtune=generic, this function returned a cost of 2
>>>>>>>>>> for a vec_to_scalar operation in the vect body. Now "if (stmt_info)"
>>>>>>>>>> is entered and "if (vect_is_store_elt_extraction (kind, stmt_info))"
>>>>>>>>>> evaluates to true, which sets the count to 0 and leads to a return
>>>>>>>>>> value of 0.
>>>>>>>>>
>>>>>>>>> At the time the code was written, a scalarised store would be costed
>>>>>>>>> using one vec_to_scalar call into the backend, with the count
>>>>>>>>> parameter
>>>>>>>>> set to the number of elements being stored. The "count -= 1" was
>>>>>>>>> supposed to lop off the leading element extraction, since we can store
>>>>>>>>> lane 0 as a normal FP store.
>>>>>>>>>
>>>>>>>>> The target-independent costing was later reworked so that it costs
>>>>>>>>> each operation individually:
>>>>>>>>>
>>>>>>>>> for (i = 0; i < nstores; i++)
>>>>>>>>> {
>>>>>>>>> if (costing_p)
>>>>>>>>> {
>>>>>>>>> /* Only need vector extracting when there are more
>>>>>>>>> than one stores. */
>>>>>>>>> if (nstores > 1)
>>>>>>>>> inside_cost
>>>>>>>>> += record_stmt_cost (cost_vec, 1, vec_to_scalar,
>>>>>>>>> stmt_info, 0, vect_body);
>>>>>>>>> /* Take a single lane vector type store as scalar
>>>>>>>>> store to avoid ICE like 110776. */
>>>>>>>>> if (VECTOR_TYPE_P (ltype)
>>>>>>>>> && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
>>>>>>>>> n_adjacent_stores++;
>>>>>>>>> else
>>>>>>>>> inside_cost
>>>>>>>>> += record_stmt_cost (cost_vec, 1, scalar_store,
>>>>>>>>> stmt_info, 0, vect_body);
>>>>>>>>> continue;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> Unfortunately, there's no easy way of telling whether a particular
>>>>>>>>> call
>>>>>>>>> is part of a group, and if so, which member of the group it is.
>>>>>>>>>
>>>>>>>>> I suppose we could give up on the attempt to be (somewhat) accurate
>>>>>>>>> and just disable the optimisation. Or we could restrict it to count
>>>>>>>>> > 1,
>>>>>>>>> since it might still be useful for gathers and scatters.
>>>>>>>> I tried restricting the calls to vect_is_store_elt_extraction to count
>>>>>>>> > 1 and it seems to resolve the issue of costing vec_to_scalar
>>>>>>>> operations with 0 (see patch below).
>>>>>>>> What are your thoughts on this?
>>>>>>>
>>>>>>> Why didn't you pursue instead moving the vec_to_scalar cost together
>>>>>>> with the n_adjacent_store handling?
>>>>>> When I continued working on this patch, we had already reached stage 3
>>>>>> and I was hesitant to introduce changes to the middle-end that were not
>>>>>> previously covered by this patch. So I tried if the issue could not be
>>>>>> resolved by making a small change in the backend.
>>>>>> If you still advise to use the n_adjacent_store instead, I’m happy to
>>>>>> look into it again.
>>>>>
>>>>> If Richard's ok with adjusting vectorizable_store for GCC 15 (which it
>>>>> sounds like he is), then I agree that would be better. Otherwise we'd
>>>>> be creating technical debt to clean up for GCC 16. And it is a regression
>>>>> of sorts, so is stage 3 material from that POV.
>>>>>
>>>>> (Incidentally, AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS was itself a
>>>>> "let's clean this up next stage 1" thing, since we needed to add tuning
>>>>> for a new CPU late during the cycle. But of course, there were other
>>>>> priorities when stage 1 actually came around, so it never actually
>>>>> happened. Thanks again for being the one to sort this out.)
>>>> Thanks for your feedback. Then I will try to make it work in
>>>> vectorizable_store.
>>>> Best,
>>>> Jennifer
>>> Below is the updated patch with a suggestion for the changes in
>>> vectorizable_store. It resolves the issue with the vec_to_scalar operations
>>> that were individually costed with 0.
>>> We already tested it on aarch64, no regression, but we are still doing
>>> performance testing.
>>> Can you give some feedback in the meantime on the patch itself?
>>> Thanks,
>>> Jennifer
>>>
>>>
>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and
>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
>>> default. To that end, the function aarch64_use_new_vector_costs_p and its
>>> uses
>>> were removed. To prevent costing vec_to_scalar operations with 0, as
>>> described in
>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
>>> we adjusted vectorizable_store such that the variable n_adjacent_stores
>>> also covers vec_to_scalar operations. This way vec_to_scalar operations
>>> are not costed individually, but as a group.
>>>
>>> Two tests were adjusted due to changes in codegen. In both cases, the
>>> old code performed loop unrolling once, but the new code does not:
>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic
>>> -moverride=tune=none):
>>> f_int64_t_32:
>>> cbz w3, .L92
>>> mov x4, 0
>>> uxtw x3, w3
>>> + cntd x5
>>> + whilelo p7.d, xzr, x3
>>> + mov z29.s, w5
>>> mov z31.s, w2
>>> - whilelo p6.d, xzr, x3
>>> - mov x2, x3
>>> - index z30.s, #0, #1
>>> - uqdecd x2
>>> - ptrue p5.b, all
>>> - whilelo p7.d, xzr, x2
>>> + index z30.d, #0, #1
>>> + ptrue p6.b, all
>>> .p2align 3,,7
>>> .L94:
>>> - ld1d z27.d, p7/z, [x0, #1, mul vl]
>>> - ld1d z28.d, p6/z, [x0]
>>> - movprfx z29, z31
>>> - mul z29.s, p5/m, z29.s, z30.s
>>> - incw x4
>>> - uunpklo z0.d, z29.s
>>> - uunpkhi z29.d, z29.s
>>> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3]
>>> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3]
>>> - add z25.d, z28.d, z25.d
>>> + ld1d z27.d, p7/z, [x0, x4, lsl 3]
>>> + movprfx z28, z31
>>> + mul z28.s, p6/m, z28.s, z30.s
>>> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3]
>>> add z26.d, z27.d, z26.d
>>> - st1d z26.d, p7, [x0, #1, mul vl]
>>> - whilelo p7.d, x4, x2
>>> - st1d z25.d, p6, [x0]
>>> - incw z30.s
>>> - incb x0, all, mul #2
>>> - whilelo p6.d, x4, x3
>>> + st1d z26.d, p7, [x0, x4, lsl 3]
>>> + add z30.s, z30.s, z29.s
>>> + incd x4
>>> + whilelo p7.d, x4, x3
>>> b.any .L94
>>> .L92:
>>> ret
>>>
>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic
>>> -moverride=tune=none):
>>> f_int64_t_32:
>>> cbz w3, .L84
>>> - addvl x5, x1, #1
>>> mov x4, 0
>>> uxtw x3, w3
>>> - mov z31.s, w2
>>> + cntd x5
>>> whilelo p7.d, xzr, x3
>>> - mov x2, x3
>>> - index z30.s, #0, #1
>>> - uqdecd x2
>>> - ptrue p5.b, all
>>> - whilelo p6.d, xzr, x2
>>> + mov z29.s, w5
>>> + mov z31.s, w2
>>> + index z30.d, #0, #1
>>> + ptrue p6.b, all
>>> .p2align 3,,7
>>> .L86:
>>> - ld1d z28.d, p7/z, [x1, x4, lsl 3]
>>> - ld1d z27.d, p6/z, [x5, x4, lsl 3]
>>> - movprfx z29, z30
>>> - mul z29.s, p5/m, z29.s, z31.s
>>> - add z28.d, z28.d, #1
>>> - uunpklo z26.d, z29.s
>>> - st1d z28.d, p7, [x0, z26.d, lsl 3]
>>> - incw x4
>>> - uunpkhi z29.d, z29.s
>>> + ld1d z27.d, p7/z, [x1, x4, lsl 3]
>>> + movprfx z28, z30
>>> + mul z28.s, p6/m, z28.s, z31.s
>>> add z27.d, z27.d, #1
>>> - whilelo p6.d, x4, x2
>>> - st1d z27.d, p7, [x0, z29.d, lsl 3]
>>> - incw z30.s
>>> + st1d z27.d, p7, [x0, z28.d, uxtw 3]
>>> + incd x4
>>> + add z30.s, z30.s, z29.s
>>> whilelo p7.d, x4, x3
>>> b.any .L86
>>> .L84:
>>> ret
>>>
>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no
>>> regression.
>>> OK for mainline?
>>>
>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>>
>>> gcc/
>>> * tree-vect-stmts.cc (vectorizable_store): Extend the use of
>>> n_adjacent_stores to also cover vec_to_scalar operations.
>>> * config/aarch64/aarch64-tuning-flags.def: Remove
>>> use_new_vector_costs as tuning option.
>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
>>> Remove.
>>> (aarch64_vector_costs::add_stmt_cost): Remove use of
>>> aarch64_use_new_vector_costs_p.
>>> (aarch64_vector_costs::finish_cost): Remove use of
>>> aarch64_use_new_vector_costs_p.
>>> * config/aarch64/tuning_models/cortexx925.h: Remove
>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
>>> * config/aarch64/tuning_models/neoversen2.h: Likewise.
>>> * config/aarch64/tuning_models/neoversen3.h: Likewise.
>>> * config/aarch64/tuning_models/neoversev1.h: Likewise.
>>> * config/aarch64/tuning_models/neoversev2.h: Likewise.
>>> * config/aarch64/tuning_models/neoversev3.h: Likewise.
>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
>>>
>>> gcc/testsuite/
>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
>>> ---
>>> gcc/config/aarch64/aarch64-tuning-flags.def | 2 --
>>> gcc/config/aarch64/aarch64.cc | 20 +++----------
>>> gcc/config/aarch64/tuning_models/cortexx925.h | 1 -
>>> .../aarch64/tuning_models/fujitsu_monaka.h | 1 -
>>> .../aarch64/tuning_models/generic_armv8_a.h | 1 -
>>> .../aarch64/tuning_models/generic_armv9_a.h | 1 -
>>> .../aarch64/tuning_models/neoverse512tvb.h | 1 -
>>> gcc/config/aarch64/tuning_models/neoversen2.h | 1 -
>>> gcc/config/aarch64/tuning_models/neoversen3.h | 1 -
>>> gcc/config/aarch64/tuning_models/neoversev1.h | 1 -
>>> gcc/config/aarch64/tuning_models/neoversev2.h | 1 -
>>> gcc/config/aarch64/tuning_models/neoversev3.h | 1 -
>>> .../aarch64/tuning_models/neoversev3ae.h | 1 -
>>> .../gcc.target/aarch64/sve/strided_load_2.c | 2 +-
>>> .../gcc.target/aarch64/sve/strided_store_2.c | 2 +-
>>> gcc/tree-vect-stmts.cc | 29 ++++++++++---------
>>> 16 files changed, 22 insertions(+), 44 deletions(-)
>>>
>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def
>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
>>> index ffbff20e29c..1de633c739b 100644
>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend",
>>> CHEAP_SHIFT_EXTEND)
>>>
>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
>>>
>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS)
>>> -
>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput",
>>> MATCHED_VECTOR_THROUGHPUT)
>>>
>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA)
>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>>> index 77a2a6bfa3a..71fba9cc63b 100644
>>> --- a/gcc/config/aarch64/aarch64.cc
>>> +++ b/gcc/config/aarch64/aarch64.cc
>>> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo,
>>> bool costing_for_scalar)
>>> return new aarch64_vector_costs (vinfo, costing_for_scalar);
>>> }
>>>
>>> -/* Return true if the current CPU should use the new costs defined
>>> - in GCC 11. This should be removed for GCC 12 and above, with the
>>> - costs applying to all CPUs instead. */
>>> -static bool
>>> -aarch64_use_new_vector_costs_p ()
>>> -{
>>> - return (aarch64_tune_params.extra_tuning_flags
>>> - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
>>> -}
>>> -
>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE. */
>>> static const simd_vec_cost *
>>> aarch64_simd_vec_costs (tree vectype)
>>> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count,
>>> vect_cost_for_stmt kind,
>>>
>>> /* Do one-time initialization based on the vinfo. */
>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
>>> - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
>>> + if (!m_analyzed_vinfo)
>>> {
>>> if (loop_vinfo)
>>> analyze_loop_vinfo (loop_vinfo);
>>> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count,
>>> vect_cost_for_stmt kind,
>>>
>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
>>> of just looking at KIND. */
>>> - if (stmt_info && aarch64_use_new_vector_costs_p ())
>>> + if (stmt_info)
>>> {
>>> /* If we scalarize a strided store, the vectorizer costs one
>>> vec_to_scalar for each element. However, we can store the first
>>> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count,
>>> vect_cost_for_stmt kind,
>>> else
>>> m_num_last_promote_demote = 0;
>>>
>>> - if (stmt_info && aarch64_use_new_vector_costs_p ())
>>> + if (stmt_info)
>>> {
>>> /* Account for any extra "embedded" costs that apply additively
>>> to the base cost calculated above. */
>>> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const
>>> vector_costs *uncast_scalar_costs)
>>>
>>> auto *scalar_costs
>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
>>> - if (loop_vinfo
>>> - && m_vec_flags
>>> - && aarch64_use_new_vector_costs_p ())
>>> + if (loop_vinfo && m_vec_flags)
>>> {
>>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>> m_costs[vect_body]);
>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h
>>> b/gcc/config/aarch64/tuning_models/cortexx925.h
>>> index b2ff716157a..0a8eff69307 100644
>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
>>> @@ -219,7 +219,6 @@ static const struct tune_params cortexx925_tunings =
>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>> (AARCH64_EXTRA_TUNE_BASE
>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
>>> &generic_prefetch_tune,
>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>> index 2d704ecd110..a564528f43d 100644
>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings =
>>> 0, /* max_case_values. */
>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>> (AARCH64_EXTRA_TUNE_BASE
>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
>>> &generic_prefetch_tune,
>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>> index bdd309ab03d..f090d5cde50 100644
>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>> @@ -183,7 +183,6 @@ static const struct tune_params generic_armv8_a_tunings
>>> =
>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>> (AARCH64_EXTRA_TUNE_BASE
>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
>>> &generic_prefetch_tune,
>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>> index a05a9ab92a2..4c33c147444 100644
>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>> @@ -249,7 +249,6 @@ static const struct tune_params generic_armv9_a_tunings
>>> =
>>> 0, /* max_case_values. */
>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>> (AARCH64_EXTRA_TUNE_BASE
>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
>>> &generic_armv9a_prefetch_tune,
>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>> index c407b89a22f..fe4f7c10f73 100644
>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>> @@ -156,7 +156,6 @@ static const struct tune_params neoverse512tvb_tunings =
>>> 0, /* max_case_values. */
>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
>>> &generic_prefetch_tune,
>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h
>>> b/gcc/config/aarch64/tuning_models/neoversen2.h
>>> index fd5f8f37370..0c74068da2c 100644
>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings =
>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>> (AARCH64_EXTRA_TUNE_BASE
>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
>>> &generic_prefetch_tune,
>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h
>>> b/gcc/config/aarch64/tuning_models/neoversen3.h
>>> index 8b156c2fe4d..9d4e1be171a 100644
>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings =
>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>> (AARCH64_EXTRA_TUNE_BASE
>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
>>> &generic_prefetch_tune,
>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h
>>> b/gcc/config/aarch64/tuning_models/neoversev1.h
>>> index 23c121d8652..85a78bb2bef 100644
>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
>>> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings =
>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>> (AARCH64_EXTRA_TUNE_BASE
>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
>>> &generic_prefetch_tune,
>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h
>>> b/gcc/config/aarch64/tuning_models/neoversev2.h
>>> index 40af5f47f4f..1dd452beb8d 100644
>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
>>> @@ -232,7 +232,6 @@ static const struct tune_params neoversev2_tunings =
>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>> (AARCH64_EXTRA_TUNE_BASE
>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */
>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h
>>> b/gcc/config/aarch64/tuning_models/neoversev3.h
>>> index d65d74bfecf..d0ba5b1aef6 100644
>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings =
>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>> (AARCH64_EXTRA_TUNE_BASE
>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
>>> &generic_prefetch_tune,
>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>> index 7b7fa0b4b08..a1572048503 100644
>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings =
>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>> (AARCH64_EXTRA_TUNE_BASE
>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
>>> &generic_prefetch_tune,
>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>> index 762805ff54b..c334b7a6875 100644
>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>> @@ -15,4 +15,4 @@
>>> so we vectorize the offset calculation. This means that the
>>> 64-bit version needs two copies. */
>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z,
>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z,
>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z,
>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>> index f0ea58e38e2..94cc63049bc 100644
>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>> @@ -15,4 +15,4 @@
>>> so we vectorize the offset calculation. This means that the
>>> 64-bit version needs two copies. */
>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7],
>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7],
>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7],
>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
>>> index be1139a423c..6d7d28c4702 100644
>>> --- a/gcc/tree-vect-stmts.cc
>>> +++ b/gcc/tree-vect-stmts.cc
>>> @@ -8834,19 +8834,16 @@ vectorizable_store (vec_info *vinfo,
>>> {
>>> if (costing_p)
>>> {
>>> - /* Only need vector extracting when there are more
>>> - than one stores. */
>>> - if (nstores > 1)
>>> - inside_cost
>>> - += record_stmt_cost (cost_vec, 1, vec_to_scalar,
>>> - stmt_info, slp_node,
>>> - 0, vect_body);
>>> /* Take a single lane vector type store as scalar
>>> store to avoid ICE like 110776. */
>>> - if (VECTOR_TYPE_P (ltype)
>>> - && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
>>> + bool single_lane_vec_p =
>>> + VECTOR_TYPE_P (ltype)
>>> + && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U);
>>> + /* Only need vector extracting when there are more
>>> + than one stores. */
>>> + if (nstores > 1 || single_lane_vec_p)
>>> n_adjacent_stores++;
>>> - else
>>> + if (!single_lane_vec_p)
>>
>> I think it's somewhat non-obvious that nstores > 1 and single_lane_vec_p
>> correlate. In fact I think that we always record a store, just for
>> single-element
>> vectors we record scalar stores. I suggest to here always to just
>> n_adjacent_stores++
>> and below ...
>>
>>> inside_cost
>>> += record_stmt_cost (cost_vec, 1, scalar_store,
>>> stmt_info, 0, vect_body);
>>> @@ -8905,9 +8902,15 @@ vectorizable_store (vec_info *vinfo,
>>> if (costing_p)
>>> {
>>> if (n_adjacent_stores > 0)
>>> - vect_get_store_cost (vinfo, stmt_info, slp_node,
>>> n_adjacent_stores,
>>> - alignment_support_scheme, misalignment,
>>> - &inside_cost, cost_vec);
>>> + {
>>> + vect_get_store_cost (vinfo, stmt_info, slp_node,
>>> n_adjacent_stores,
>>> + alignment_support_scheme, misalignment,
>>> + &inside_cost, cost_vec);
>>
>> ... record n_adjacent_stores scalar_store when ltype is single-lane and
>> record
>> n_adjacent_stores vect_to_scalar if nstores > 1 (and else none).
>>
>> Richard.
> Thanks for the feedback, I’m glad it’s going in the right direction. Below is
> the updated patch, re-validated on aarch64.
> Thanks, Jennifer
>
> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and
> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
> default. To that end, the function aarch64_use_new_vector_costs_p and its uses
> were removed. To prevent costing vec_to_scalar operations with 0, as
> described in
> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
> we adjusted vectorizable_store such that the variable n_adjacent_stores
> also covers vec_to_scalar operations. This way vec_to_scalar operations
> are not costed individually, but as a group.
>
> Two tests were adjusted due to changes in codegen. In both cases, the
> old code performed loop unrolling once, but the new code does not:
> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic
> -moverride=tune=none):
> f_int64_t_32:
> cbz w3, .L92
> mov x4, 0
> uxtw x3, w3
> + cntd x5
> + whilelo p7.d, xzr, x3
> + mov z29.s, w5
> mov z31.s, w2
> - whilelo p6.d, xzr, x3
> - mov x2, x3
> - index z30.s, #0, #1
> - uqdecd x2
> - ptrue p5.b, all
> - whilelo p7.d, xzr, x2
> + index z30.d, #0, #1
> + ptrue p6.b, all
> .p2align 3,,7
> .L94:
> - ld1d z27.d, p7/z, [x0, #1, mul vl]
> - ld1d z28.d, p6/z, [x0]
> - movprfx z29, z31
> - mul z29.s, p5/m, z29.s, z30.s
> - incw x4
> - uunpklo z0.d, z29.s
> - uunpkhi z29.d, z29.s
> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3]
> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3]
> - add z25.d, z28.d, z25.d
> + ld1d z27.d, p7/z, [x0, x4, lsl 3]
> + movprfx z28, z31
> + mul z28.s, p6/m, z28.s, z30.s
> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3]
> add z26.d, z27.d, z26.d
> - st1d z26.d, p7, [x0, #1, mul vl]
> - whilelo p7.d, x4, x2
> - st1d z25.d, p6, [x0]
> - incw z30.s
> - incb x0, all, mul #2
> - whilelo p6.d, x4, x3
> + st1d z26.d, p7, [x0, x4, lsl 3]
> + add z30.s, z30.s, z29.s
> + incd x4
> + whilelo p7.d, x4, x3
> b.any .L94
> .L92:
> ret
>
> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic
> -moverride=tune=none):
> f_int64_t_32:
> cbz w3, .L84
> - addvl x5, x1, #1
> mov x4, 0
> uxtw x3, w3
> - mov z31.s, w2
> + cntd x5
> whilelo p7.d, xzr, x3
> - mov x2, x3
> - index z30.s, #0, #1
> - uqdecd x2
> - ptrue p5.b, all
> - whilelo p6.d, xzr, x2
> + mov z29.s, w5
> + mov z31.s, w2
> + index z30.d, #0, #1
> + ptrue p6.b, all
> .p2align 3,,7
> .L86:
> - ld1d z28.d, p7/z, [x1, x4, lsl 3]
> - ld1d z27.d, p6/z, [x5, x4, lsl 3]
> - movprfx z29, z30
> - mul z29.s, p5/m, z29.s, z31.s
> - add z28.d, z28.d, #1
> - uunpklo z26.d, z29.s
> - st1d z28.d, p7, [x0, z26.d, lsl 3]
> - incw x4
> - uunpkhi z29.d, z29.s
> + ld1d z27.d, p7/z, [x1, x4, lsl 3]
> + movprfx z28, z30
> + mul z28.s, p6/m, z28.s, z31.s
> add z27.d, z27.d, #1
> - whilelo p6.d, x4, x2
> - st1d z27.d, p7, [x0, z29.d, lsl 3]
> - incw z30.s
> + st1d z27.d, p7, [x0, z28.d, uxtw 3]
> + incd x4
> + add z30.s, z30.s, z29.s
> whilelo p7.d, x4, x3
> b.any .L86
> .L84:
> ret
>
> The patch was bootstrapped and tested on aarch64-linux-gnu, no
> regression.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>
> gcc/
> * tree-vect-stmts.cc (vectorizable_store): Extend the use of
> n_adjacent_stores to also cover vec_to_scalar operations.
> * config/aarch64/aarch64-tuning-flags.def: Remove
> use_new_vector_costs as tuning option.
> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
> Remove.
> (aarch64_vector_costs::add_stmt_cost): Remove use of
> aarch64_use_new_vector_costs_p.
> (aarch64_vector_costs::finish_cost): Remove use of
> aarch64_use_new_vector_costs_p.
> * config/aarch64/tuning_models/cortexx925.h: Remove
> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
> * config/aarch64/tuning_models/neoversen2.h: Likewise.
> * config/aarch64/tuning_models/neoversen3.h: Likewise.
> * config/aarch64/tuning_models/neoversev1.h: Likewise.
> * config/aarch64/tuning_models/neoversev2.h: Likewise.
> * config/aarch64/tuning_models/neoversev3.h: Likewise.
> * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
>
> gcc/testsuite/
> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
> * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
> ---
> gcc/config/aarch64/aarch64-tuning-flags.def | 2 -
> gcc/config/aarch64/aarch64.cc | 20 ++--------
> gcc/config/aarch64/tuning_models/cortexx925.h | 1 -
> .../aarch64/tuning_models/fujitsu_monaka.h | 1 -
> .../aarch64/tuning_models/generic_armv8_a.h | 1 -
> .../aarch64/tuning_models/generic_armv9_a.h | 1 -
> .../aarch64/tuning_models/neoverse512tvb.h | 1 -
> gcc/config/aarch64/tuning_models/neoversen2.h | 1 -
> gcc/config/aarch64/tuning_models/neoversen3.h | 1 -
> gcc/config/aarch64/tuning_models/neoversev1.h | 1 -
> gcc/config/aarch64/tuning_models/neoversev2.h | 1 -
> gcc/config/aarch64/tuning_models/neoversev3.h | 1 -
> .../aarch64/tuning_models/neoversev3ae.h | 1 -
> .../gcc.target/aarch64/sve/strided_load_2.c | 2 +-
> .../gcc.target/aarch64/sve/strided_store_2.c | 2 +-
> gcc/tree-vect-stmts.cc | 37 +++++++++++--------
> 16 files changed, 27 insertions(+), 47 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def
> b/gcc/config/aarch64/aarch64-tuning-flags.def
> index ffbff20e29c..1de633c739b 100644
> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend",
> CHEAP_SHIFT_EXTEND)
>
> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
>
> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS)
> -
> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput",
> MATCHED_VECTOR_THROUGHPUT)
>
> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA)
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 77a2a6bfa3a..71fba9cc63b 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo,
> bool costing_for_scalar)
> return new aarch64_vector_costs (vinfo, costing_for_scalar);
> }
>
> -/* Return true if the current CPU should use the new costs defined
> - in GCC 11. This should be removed for GCC 12 and above, with the
> - costs applying to all CPUs instead. */
> -static bool
> -aarch64_use_new_vector_costs_p ()
> -{
> - return (aarch64_tune_params.extra_tuning_flags
> - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
> -}
> -
> /* Return the appropriate SIMD costs for vectors of type VECTYPE. */
> static const simd_vec_cost *
> aarch64_simd_vec_costs (tree vectype)
> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count,
> vect_cost_for_stmt kind,
>
> /* Do one-time initialization based on the vinfo. */
> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
> - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
> + if (!m_analyzed_vinfo)
> {
> if (loop_vinfo)
> analyze_loop_vinfo (loop_vinfo);
> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count,
> vect_cost_for_stmt kind,
>
> /* Try to get a more accurate cost by looking at STMT_INFO instead
> of just looking at KIND. */
> - if (stmt_info && aarch64_use_new_vector_costs_p ())
> + if (stmt_info)
> {
> /* If we scalarize a strided store, the vectorizer costs one
> vec_to_scalar for each element. However, we can store the first
> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count,
> vect_cost_for_stmt kind,
> else
> m_num_last_promote_demote = 0;
>
> - if (stmt_info && aarch64_use_new_vector_costs_p ())
> + if (stmt_info)
> {
> /* Account for any extra "embedded" costs that apply additively
> to the base cost calculated above. */
> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const vector_costs
> *uncast_scalar_costs)
>
> auto *scalar_costs
> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
> - if (loop_vinfo
> - && m_vec_flags
> - && aarch64_use_new_vector_costs_p ())
> + if (loop_vinfo && m_vec_flags)
> {
> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
> m_costs[vect_body]);
> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h
> b/gcc/config/aarch64/tuning_models/cortexx925.h
> index 5ebaf66e986..74772f3e15f 100644
> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
> @@ -221,7 +221,6 @@ static const struct tune_params cortexx925_tunings =
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_BASE
> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> &generic_armv9a_prefetch_tune,
> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> index 2d704ecd110..a564528f43d 100644
> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings =
> 0, /* max_case_values. */
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_BASE
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
> &generic_prefetch_tune,
> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> index bdd309ab03d..f090d5cde50 100644
> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> @@ -183,7 +183,6 @@ static const struct tune_params generic_armv8_a_tunings =
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_BASE
> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
> &generic_prefetch_tune,
> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> index 785e00946bc..7b5821183bc 100644
> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> @@ -251,7 +251,6 @@ static const struct tune_params generic_armv9_a_tunings =
> 0, /* max_case_values. */
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_BASE
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
> &generic_armv9a_prefetch_tune,
> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> index 007f987154c..f7457df59e5 100644
> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> @@ -156,7 +156,6 @@ static const struct tune_params neoverse512tvb_tunings =
> 0, /* max_case_values. */
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
> &generic_armv9a_prefetch_tune,
> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h
> b/gcc/config/aarch64/tuning_models/neoversen2.h
> index 32560d2f5f8..541b61c8179 100644
> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings =
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_BASE
> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> &generic_armv9a_prefetch_tune,
> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h
> b/gcc/config/aarch64/tuning_models/neoversen3.h
> index 2010bc4645b..eff668132a8 100644
> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings =
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_BASE
> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
> &generic_armv9a_prefetch_tune,
> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h
> b/gcc/config/aarch64/tuning_models/neoversev1.h
> index c3751e32696..d11472b6e1e 100644
> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings =
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_BASE
> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> &generic_armv9a_prefetch_tune,
> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h
> b/gcc/config/aarch64/tuning_models/neoversev2.h
> index 80dbe5c806c..ee77ffdd3bc 100644
> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
> @@ -219,7 +219,6 @@ static const struct tune_params neoversev2_tunings =
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_BASE
> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */
> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h
> b/gcc/config/aarch64/tuning_models/neoversev3.h
> index efe09e16d1e..6ef143ef7d5 100644
> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings =
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_BASE
> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> &generic_armv9a_prefetch_tune,
> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h
> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> index 66849f30889..96bdbf971f1 100644
> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings =
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_BASE
> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> &generic_armv9a_prefetch_tune,
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> index 762805ff54b..c334b7a6875 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> @@ -15,4 +15,4 @@
> so we vectorize the offset calculation. This means that the
> 64-bit version needs two copies. */
> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z,
> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z,
> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z,
> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> index f0ea58e38e2..94cc63049bc 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> @@ -15,4 +15,4 @@
> so we vectorize the offset calculation. This means that the
> 64-bit version needs two copies. */
> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], \[x[0-9]+,
> z[0-9]+.s, uxtw 2\]\n} 3 } } */
> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[x[0-9]+,
> z[0-9]+.d, lsl 3\]\n} 15 } } */
> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[x[0-9]+,
> z[0-9]+.d, lsl 3\]\n} 9 } } */
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index be1139a423c..ab57163c243 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -8834,19 +8834,8 @@ vectorizable_store (vec_info *vinfo,
> {
> if (costing_p)
> {
> - /* Only need vector extracting when there are more
> - than one stores. */
> - if (nstores > 1)
> - inside_cost
> - += record_stmt_cost (cost_vec, 1, vec_to_scalar,
> - stmt_info, slp_node,
> - 0, vect_body);
> - /* Take a single lane vector type store as scalar
> - store to avoid ICE like 110776. */
> - if (VECTOR_TYPE_P (ltype)
> - && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
> - n_adjacent_stores++;
> - else
> + n_adjacent_stores++;
> + if (!VECTOR_TYPE_P (ltype))
This should be combined with the Single lane Vector case belle
> inside_cost
> += record_stmt_cost (cost_vec, 1, scalar_store,
> stmt_info, 0, vect_body);
> @@ -8905,9 +8894,25 @@ vectorizable_store (vec_info *vinfo,
> if (costing_p)
> {
> if (n_adjacent_stores > 0)
> - vect_get_store_cost (vinfo, stmt_info, slp_node, n_adjacent_stores,
> - alignment_support_scheme, misalignment,
> - &inside_cost, cost_vec);
> + {
> + /* Take a single lane vector type store as scalar
> + store to avoid ICE like 110776. */
> + if (VECTOR_TYPE_P (ltype)
> + && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
> + inside_cost
> + += record_stmt_cost (cost_vec, n_adjacent_stores,
> + scalar_store, stmt_info, 0, vect_body);
> + /* Only need vector extracting when there are more
> + than one stores. */
> + if (nstores > 1)
> + inside_cost
> + += record_stmt_cost (cost_vec, n_adjacent_stores,
> + vec_to_scalar, stmt_info, slp_node,
> + 0, vect_body);
> + vect_get_store_cost (vinfo, stmt_info, slp_node,
This should be Inlay done for Multi-lane vectors
> + n_adjacent_stores, alignment_support_scheme,
> + misalignment, &inside_cost, cost_vec);
> + }
> if (dump_enabled_p ())
> dump_printf_loc (MSG_NOTE, vect_location,
> "vect_model_store_cost: inside_cost = %d, "
> --
> 2.34.1
>>
>>> + inside_cost
>>> + += record_stmt_cost (cost_vec, n_adjacent_stores,
>>> vec_to_scalar,
>>> + stmt_info, slp_node,
>>> + 0, vect_body);
>>> + }
>>> if (dump_enabled_p ())
>>> dump_printf_loc (MSG_NOTE, vect_location,
>>> "vect_model_store_cost: inside_cost = %d, "
>>> --
>>> 2.44.0
>>>
>>>
>>>>>
>>>>> Richard
>>>>>
>>>>>> Thanks,
>>>>>> Jennifer
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Jennifer
>>>>>>>>
>>>>>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable
>>>>>>>> and
>>>>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
>>>>>>>> default. To that end, the function aarch64_use_new_vector_costs_p and
>>>>>>>> its uses
>>>>>>>> were removed. To prevent costing vec_to_scalar operations with 0, as
>>>>>>>> described in
>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
>>>>>>>> we guarded the call to vect_is_store_elt_extraction in
>>>>>>>> aarch64_vector_costs::add_stmt_cost by count > 1.
>>>>>>>>
>>>>>>>> Two tests were adjusted due to changes in codegen. In both cases, the
>>>>>>>> old code performed loop unrolling once, but the new code does not:
>>>>>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
>>>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic
>>>>>>>> -moverride=tune=none):
>>>>>>>> f_int64_t_32:
>>>>>>>> cbz w3, .L92
>>>>>>>> mov x4, 0
>>>>>>>> uxtw x3, w3
>>>>>>>> + cntd x5
>>>>>>>> + whilelo p7.d, xzr, x3
>>>>>>>> + mov z29.s, w5
>>>>>>>> mov z31.s, w2
>>>>>>>> - whilelo p6.d, xzr, x3
>>>>>>>> - mov x2, x3
>>>>>>>> - index z30.s, #0, #1
>>>>>>>> - uqdecd x2
>>>>>>>> - ptrue p5.b, all
>>>>>>>> - whilelo p7.d, xzr, x2
>>>>>>>> + index z30.d, #0, #1
>>>>>>>> + ptrue p6.b, all
>>>>>>>> .p2align 3,,7
>>>>>>>> .L94:
>>>>>>>> - ld1d z27.d, p7/z, [x0, #1, mul vl]
>>>>>>>> - ld1d z28.d, p6/z, [x0]
>>>>>>>> - movprfx z29, z31
>>>>>>>> - mul z29.s, p5/m, z29.s, z30.s
>>>>>>>> - incw x4
>>>>>>>> - uunpklo z0.d, z29.s
>>>>>>>> - uunpkhi z29.d, z29.s
>>>>>>>> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3]
>>>>>>>> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3]
>>>>>>>> - add z25.d, z28.d, z25.d
>>>>>>>> + ld1d z27.d, p7/z, [x0, x4, lsl 3]
>>>>>>>> + movprfx z28, z31
>>>>>>>> + mul z28.s, p6/m, z28.s, z30.s
>>>>>>>> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3]
>>>>>>>> add z26.d, z27.d, z26.d
>>>>>>>> - st1d z26.d, p7, [x0, #1, mul vl]
>>>>>>>> - whilelo p7.d, x4, x2
>>>>>>>> - st1d z25.d, p6, [x0]
>>>>>>>> - incw z30.s
>>>>>>>> - incb x0, all, mul #2
>>>>>>>> - whilelo p6.d, x4, x3
>>>>>>>> + st1d z26.d, p7, [x0, x4, lsl 3]
>>>>>>>> + add z30.s, z30.s, z29.s
>>>>>>>> + incd x4
>>>>>>>> + whilelo p7.d, x4, x3
>>>>>>>> b.any .L94
>>>>>>>> .L92:
>>>>>>>> ret
>>>>>>>>
>>>>>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
>>>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic
>>>>>>>> -moverride=tune=none):
>>>>>>>> f_int64_t_32:
>>>>>>>> cbz w3, .L84
>>>>>>>> - addvl x5, x1, #1
>>>>>>>> mov x4, 0
>>>>>>>> uxtw x3, w3
>>>>>>>> - mov z31.s, w2
>>>>>>>> + cntd x5
>>>>>>>> whilelo p7.d, xzr, x3
>>>>>>>> - mov x2, x3
>>>>>>>> - index z30.s, #0, #1
>>>>>>>> - uqdecd x2
>>>>>>>> - ptrue p5.b, all
>>>>>>>> - whilelo p6.d, xzr, x2
>>>>>>>> + mov z29.s, w5
>>>>>>>> + mov z31.s, w2
>>>>>>>> + index z30.d, #0, #1
>>>>>>>> + ptrue p6.b, all
>>>>>>>> .p2align 3,,7
>>>>>>>> .L86:
>>>>>>>> - ld1d z28.d, p7/z, [x1, x4, lsl 3]
>>>>>>>> - ld1d z27.d, p6/z, [x5, x4, lsl 3]
>>>>>>>> - movprfx z29, z30
>>>>>>>> - mul z29.s, p5/m, z29.s, z31.s
>>>>>>>> - add z28.d, z28.d, #1
>>>>>>>> - uunpklo z26.d, z29.s
>>>>>>>> - st1d z28.d, p7, [x0, z26.d, lsl 3]
>>>>>>>> - incw x4
>>>>>>>> - uunpkhi z29.d, z29.s
>>>>>>>> + ld1d z27.d, p7/z, [x1, x4, lsl 3]
>>>>>>>> + movprfx z28, z30
>>>>>>>> + mul z28.s, p6/m, z28.s, z31.s
>>>>>>>> add z27.d, z27.d, #1
>>>>>>>> - whilelo p6.d, x4, x2
>>>>>>>> - st1d z27.d, p7, [x0, z29.d, lsl 3]
>>>>>>>> - incw z30.s
>>>>>>>> + st1d z27.d, p7, [x0, z28.d, uxtw 3]
>>>>>>>> + incd x4
>>>>>>>> + add z30.s, z30.s, z29.s
>>>>>>>> whilelo p7.d, x4, x3
>>>>>>>> b.any .L86
>>>>>>>> .L84:
>>>>>>>> ret
>>>>>>>>
>>>>>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no
>>>>>>>> regression. We also ran SPEC2017 with -mcpu=generic on a Grace machine
>>>>>>>> and saw
>>>>>>>> no non-noise impact on performance. We would appreciate help with wider
>>>>>>>> benchmarking on other platforms, if necessary.
>>>>>>>> OK for mainline?
>>>>>>>>
>>>>>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>>>>>>>
>>>>>>>> gcc/
>>>>>>>> * config/aarch64/aarch64-tuning-flags.def: Remove
>>>>>>>> use_new_vector_costs as tuning option.
>>>>>>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
>>>>>>>> Remove.
>>>>>>>> (aarch64_vector_costs::add_stmt_cost): Remove use of
>>>>>>>> aarch64_use_new_vector_costs_p and guard call to
>>>>>>>> vect_is_store_elt_extraction with count > 1.
>>>>>>>> (aarch64_vector_costs::finish_cost): Remove use of
>>>>>>>> aarch64_use_new_vector_costs_p.
>>>>>>>> * config/aarch64/tuning_models/cortexx925.h: Remove
>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
>>>>>>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
>>>>>>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
>>>>>>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
>>>>>>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
>>>>>>>> * config/aarch64/tuning_models/neoversen2.h: Likewise.
>>>>>>>> * config/aarch64/tuning_models/neoversen3.h: Likewise.
>>>>>>>> * config/aarch64/tuning_models/neoversev1.h: Likewise.
>>>>>>>> * config/aarch64/tuning_models/neoversev2.h: Likewise.
>>>>>>>> * config/aarch64/tuning_models/neoversev3.h: Likewise.
>>>>>>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
>>>>>>>>
>>>>>>>> gcc/testsuite/
>>>>>>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
>>>>>>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
>>>>>>>> ---
>>>>>>>> gcc/config/aarch64/aarch64-tuning-flags.def | 2 --
>>>>>>>> gcc/config/aarch64/aarch64.cc | 22 +++++--------------
>>>>>>>> gcc/config/aarch64/tuning_models/cortexx925.h | 1 -
>>>>>>>> .../aarch64/tuning_models/fujitsu_monaka.h | 1 -
>>>>>>>> .../aarch64/tuning_models/generic_armv8_a.h | 1 -
>>>>>>>> .../aarch64/tuning_models/generic_armv9_a.h | 1 -
>>>>>>>> .../aarch64/tuning_models/neoverse512tvb.h | 1 -
>>>>>>>> gcc/config/aarch64/tuning_models/neoversen2.h | 1 -
>>>>>>>> gcc/config/aarch64/tuning_models/neoversen3.h | 1 -
>>>>>>>> gcc/config/aarch64/tuning_models/neoversev1.h | 1 -
>>>>>>>> gcc/config/aarch64/tuning_models/neoversev2.h | 1 -
>>>>>>>> gcc/config/aarch64/tuning_models/neoversev3.h | 1 -
>>>>>>>> .../aarch64/tuning_models/neoversev3ae.h | 1 -
>>>>>>>> .../gcc.target/aarch64/sve/strided_load_2.c | 2 +-
>>>>>>>> .../gcc.target/aarch64/sve/strided_store_2.c | 2 +-
>>>>>>>> 15 files changed, 7 insertions(+), 32 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>>>> index 5939602576b..ed345b13ed3 100644
>>>>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend",
>>>>>>>> CHEAP_SHIFT_EXTEND)
>>>>>>>>
>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants",
>>>>>>>> CSE_SVE_VL_CONSTANTS)
>>>>>>>>
>>>>>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs",
>>>>>>>> USE_NEW_VECTOR_COSTS)
>>>>>>>> -
>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput",
>>>>>>>> MATCHED_VECTOR_THROUGHPUT)
>>>>>>>>
>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma",
>>>>>>>> AVOID_CROSS_LOOP_FMA)
>>>>>>>> diff --git a/gcc/config/aarch64/aarch64.cc
>>>>>>>> b/gcc/config/aarch64/aarch64.cc
>>>>>>>> index 43238aefef2..03806671c97 100644
>>>>>>>> --- a/gcc/config/aarch64/aarch64.cc
>>>>>>>> +++ b/gcc/config/aarch64/aarch64.cc
>>>>>>>> @@ -16566,16 +16566,6 @@ aarch64_vectorize_create_costs (vec_info
>>>>>>>> *vinfo, bool costing_for_scalar)
>>>>>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar);
>>>>>>>> }
>>>>>>>>
>>>>>>>> -/* Return true if the current CPU should use the new costs defined
>>>>>>>> - in GCC 11. This should be removed for GCC 12 and above, with the
>>>>>>>> - costs applying to all CPUs instead. */
>>>>>>>> -static bool
>>>>>>>> -aarch64_use_new_vector_costs_p ()
>>>>>>>> -{
>>>>>>>> - return (aarch64_tune_params.extra_tuning_flags
>>>>>>>> - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
>>>>>>>> -}
>>>>>>>> -
>>>>>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE. */
>>>>>>>> static const simd_vec_cost *
>>>>>>>> aarch64_simd_vec_costs (tree vectype)
>>>>>>>> @@ -17494,7 +17484,7 @@ aarch64_vector_costs::add_stmt_cost (int
>>>>>>>> count, vect_cost_for_stmt kind,
>>>>>>>>
>>>>>>>> /* Do one-time initialization based on the vinfo. */
>>>>>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
>>>>>>>> - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
>>>>>>>> + if (!m_analyzed_vinfo)
>>>>>>>> {
>>>>>>>> if (loop_vinfo)
>>>>>>>> analyze_loop_vinfo (loop_vinfo);
>>>>>>>> @@ -17512,12 +17502,12 @@ aarch64_vector_costs::add_stmt_cost (int
>>>>>>>> count, vect_cost_for_stmt kind,
>>>>>>>>
>>>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
>>>>>>>> of just looking at KIND. */
>>>>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>>>>> + if (stmt_info)
>>>>>>>> {
>>>>>>>> /* If we scalarize a strided store, the vectorizer costs one
>>>>>>>> vec_to_scalar for each element. However, we can store the first
>>>>>>>> element using an FP store without a separate extract step. */
>>>>>>>> - if (vect_is_store_elt_extraction (kind, stmt_info))
>>>>>>>> + if (vect_is_store_elt_extraction (kind, stmt_info) && count > 1)
>>>>>>>> count -= 1;
>>>>>>>>
>>>>>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
>>>>>>>> @@ -17577,7 +17567,7 @@ aarch64_vector_costs::add_stmt_cost (int
>>>>>>>> count, vect_cost_for_stmt kind,
>>>>>>>> else
>>>>>>>> m_num_last_promote_demote = 0;
>>>>>>>>
>>>>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>>>>> + if (stmt_info)
>>>>>>>> {
>>>>>>>> /* Account for any extra "embedded" costs that apply additively
>>>>>>>> to the base cost calculated above. */
>>>>>>>> @@ -17938,9 +17928,7 @@ aarch64_vector_costs::finish_cost (const
>>>>>>>> vector_costs *uncast_scalar_costs)
>>>>>>>>
>>>>>>>> auto *scalar_costs
>>>>>>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
>>>>>>>> - if (loop_vinfo
>>>>>>>> - && m_vec_flags
>>>>>>>> - && aarch64_use_new_vector_costs_p ())
>>>>>>>> + if (loop_vinfo && m_vec_flags)
>>>>>>>> {
>>>>>>>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>>>>>>> m_costs[vect_body]);
>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>>>>> index eb9b89984b0..dafea96e924 100644
>>>>>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params cortexx925_tunings
>>>>>>>> =
>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
>>>>>>>> &generic_prefetch_tune,
>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>>>>> index 6a098497759..ac001927959 100644
>>>>>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>>>>> @@ -55,7 +55,6 @@ static const struct tune_params
>>>>>>>> fujitsu_monaka_tunings =
>>>>>>>> 0, /* max_case_values. */
>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
>>>>>>>> &generic_prefetch_tune,
>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>>>>> index 9b1cbfc5bd2..7b534831340 100644
>>>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>>>>> @@ -183,7 +183,6 @@ static const struct tune_params
>>>>>>>> generic_armv8_a_tunings =
>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
>>>>>>>> &generic_prefetch_tune,
>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>>>>> index 48353a59939..562ef89c67b 100644
>>>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>>>>> @@ -249,7 +249,6 @@ static const struct tune_params
>>>>>>>> generic_armv9_a_tunings =
>>>>>>>> 0, /* max_case_values. */
>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
>>>>>>>> &generic_armv9a_prefetch_tune,
>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>>>>> index c407b89a22f..fe4f7c10f73 100644
>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>>>>> @@ -156,7 +156,6 @@ static const struct tune_params
>>>>>>>> neoverse512tvb_tunings =
>>>>>>>> 0, /* max_case_values. */
>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
>>>>>>>> &generic_prefetch_tune,
>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>>>>> index 18199ac206c..56be77423cb 100644
>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings
>>>>>>>> =
>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
>>>>>>>> &generic_prefetch_tune,
>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>>>>> index 4da85cfac0d..254ad5e27f8 100644
>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings
>>>>>>>> =
>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
>>>>>>>> &generic_prefetch_tune,
>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>>>>> index dd9120eee48..c7241cf23d7 100644
>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>>>>> @@ -227,7 +227,6 @@ static const struct tune_params neoversev1_tunings
>>>>>>>> =
>>>>>>>> 0, /* max_case_values. */
>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>> | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>>>>> index 1369de73991..96f55940649 100644
>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>>>>> @@ -232,7 +232,6 @@ static const struct tune_params neoversev2_tunings
>>>>>>>> =
>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
>>>>>>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */
>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>>>>> index d8c82255378..f62ae67d355 100644
>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings
>>>>>>>> =
>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
>>>>>>>> &generic_prefetch_tune,
>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>>>>> index 7f050501ede..0233baf5e34 100644
>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params
>>>>>>>> neoversev3ae_tunings =
>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
>>>>>>>> &generic_prefetch_tune,
>>>>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>>>>> index 762805ff54b..c334b7a6875 100644
>>>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>>>>> @@ -15,4 +15,4 @@
>>>>>>>> so we vectorize the offset calculation. This means that the
>>>>>>>> 64-bit version needs two copies. */
>>>>>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z,
>>>>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>>>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z,
>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>>>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z,
>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>>>>> index f0ea58e38e2..94cc63049bc 100644
>>>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>>>>> @@ -15,4 +15,4 @@
>>>>>>>> so we vectorize the offset calculation. This means that the
>>>>>>>> 64-bit version needs two copies. */
>>>>>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7],
>>>>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>>>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7],
>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>>>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7],
>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Richard Biener <rguent...@suse.de>
>>>>>>> SUSE Software Solutions Germany GmbH,
>>>>>>> Frankenstrasse 146, 90461 Nuernberg, Germany;
>>>>>>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
>>>>>>> Nuernberg)
>
>