Re: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

Jennifer Schmitz Wed, 18 Dec 2024 09:48:43 -0800


> On 17 Dec 2024, at 18:57, Richard Biener <rguent...@suse.de> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
>> Am 16.12.2024 um 09:10 schrieb Jennifer Schmitz <jschm...@nvidia.com>:
>> 
>> 
>> 
>>> On 14 Dec 2024, at 09:32, Richard Biener <rguent...@suse.de> wrote:
>>> 
>>> External email: Use caution opening links or attachments
>>> 
>>> 
>>>>> Am 13.12.2024 um 18:00 schrieb Jennifer Schmitz <jschm...@nvidia.com>:
>>>> 
>>>> 
>>>> 
>>>>> On 13 Dec 2024, at 13:40, Richard Biener <richard.guent...@gmail.com> 
>>>>> wrote:
>>>>> 
>>>>> External email: Use caution opening links or attachments
>>>>> 
>>>>> 
>>>>>> On Thu, Dec 12, 2024 at 5:27 PM Jennifer Schmitz <jschm...@nvidia.com> 
>>>>>> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 6 Dec 2024, at 08:41, Jennifer Schmitz <jschm...@nvidia.com> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 5 Dec 2024, at 20:07, Richard Sandiford <richard.sandif...@arm.com> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> External email: Use caution opening links or attachments
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>>>>>>>>> On 5 Dec 2024, at 11:44, Richard Biener <rguent...@suse.de> wrote:
>>>>>>>>>> 
>>>>>>>>>> External email: Use caution opening links or attachments
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Thu, 5 Dec 2024, Jennifer Schmitz wrote:
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On 17 Oct 2024, at 19:23, Richard Sandiford 
>>>>>>>>>>>> <richard.sandif...@arm.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> External email: Use caution opening links or attachments
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>>>>>>>>>>>> [...]
>>>>>>>>>>>>> Looking at the diff of the vect dumps (below is a section of the 
>>>>>>>>>>>>> diff for strided_store_2.c), it seemed odd that vec_to_scalar 
>>>>>>>>>>>>> operations cost 0 now, instead of the previous cost of 2:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> +strided_store_1.c:38:151: note:    === vectorizable_operation ===
>>>>>>>>>>>>> +strided_store_1.c:38:151: note:    vect_model_simple_cost: 
>>>>>>>>>>>>> inside_cost = 1, prologue_cost = 0 .
>>>>>>>>>>>>> +strided_store_1.c:38:151: note:   ==> examining statement: *_6 = 
>>>>>>>>>>>>> _7;
>>>>>>>>>>>>> +strided_store_1.c:38:151: note:   vect_is_simple_use: operand _3 
>>>>>>>>>>>>> + 1.0e+0, type of def:    internal
>>>>>>>>>>>>> +strided_store_1.c:38:151: note:   Vectorizing an unaligned 
>>>>>>>>>>>>> access.
>>>>>>>>>>>>> +Applying pattern match.pd:236, generic-match-9.cc:4128
>>>>>>>>>>>>> +Applying pattern match.pd:5285, generic-match-10.cc:4234
>>>>>>>>>>>>> +strided_store_1.c:38:151: note:   vect_model_store_cost: 
>>>>>>>>>>>>> inside_cost = 12, prologue_cost = 0 .
>>>>>>>>>>>>> *_2 1 times unaligned_load (misalign -1) costs 1 in body
>>>>>>>>>>>>> -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue
>>>>>>>>>>>>> _3 + 1.0e+0 1 times vector_stmt costs 1 in body
>>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>>>>>>>>> +<unknown> 1 times vector_load costs 1 in prologue
>>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Although the aarch64_use_new_vector_costs_p flag was used in 
>>>>>>>>>>>>> multiple places in aarch64.cc, the location that causes this 
>>>>>>>>>>>>> behavior is this one:
>>>>>>>>>>>>> unsigned
>>>>>>>>>>>>> aarch64_vector_costs::add_stmt_cost (int count, 
>>>>>>>>>>>>> vect_cost_for_stmt kind,
>>>>>>>>>>>>>                             stmt_vec_info stmt_info, slp_tree,
>>>>>>>>>>>>>                             tree vectype, int misalign,
>>>>>>>>>>>>>                             vect_cost_model_location where)
>>>>>>>>>>>>> {
>>>>>>>>>>>>> [...]
>>>>>>>>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
>>>>>>>>>>>>> of just looking at KIND.  */
>>>>>>>>>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>>>>>>>>>> +  if (stmt_info)
>>>>>>>>>>>>> {
>>>>>>>>>>>>> /* If we scalarize a strided store, the vectorizer costs one
>>>>>>>>>>>>> vec_to_scalar for each element.  However, we can store the first
>>>>>>>>>>>>> element using an FP store without a separate extract step.  */
>>>>>>>>>>>>> if (vect_is_store_elt_extraction (kind, stmt_info))
>>>>>>>>>>>>> count -= 1;
>>>>>>>>>>>>> 
>>>>>>>>>>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
>>>>>>>>>>>>>                                              stmt_info, 
>>>>>>>>>>>>> stmt_cost);
>>>>>>>>>>>>> 
>>>>>>>>>>>>> if (vectype && m_vec_flags)
>>>>>>>>>>>>> stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind,
>>>>>>>>>>>>>                                                stmt_info, vectype,
>>>>>>>>>>>>>                                                where, stmt_cost);
>>>>>>>>>>>>> }
>>>>>>>>>>>>> [...]
>>>>>>>>>>>>> return record_stmt_cost (stmt_info, where, (count * 
>>>>>>>>>>>>> stmt_cost).ceil ());
>>>>>>>>>>>>> }
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Previously, for mtune=generic, this function returned a cost of 2 
>>>>>>>>>>>>> for a vec_to_scalar operation in the vect body. Now "if 
>>>>>>>>>>>>> (stmt_info)" is entered and "if (vect_is_store_elt_extraction 
>>>>>>>>>>>>> (kind, stmt_info))" evaluates to true, which sets the count to 0 
>>>>>>>>>>>>> and leads to a return value of 0.
>>>>>>>>>>>> 
>>>>>>>>>>>> At the time the code was written, a scalarised store would be 
>>>>>>>>>>>> costed
>>>>>>>>>>>> using one vec_to_scalar call into the backend, with the count 
>>>>>>>>>>>> parameter
>>>>>>>>>>>> set to the number of elements being stored.  The "count -= 1" was
>>>>>>>>>>>> supposed to lop off the leading element extraction, since we can 
>>>>>>>>>>>> store
>>>>>>>>>>>> lane 0 as a normal FP store.
>>>>>>>>>>>> 
>>>>>>>>>>>> The target-independent costing was later reworked so that it costs
>>>>>>>>>>>> each operation individually:
>>>>>>>>>>>> 
>>>>>>>>>>>>      for (i = 0; i < nstores; i++)
>>>>>>>>>>>>        {
>>>>>>>>>>>>          if (costing_p)
>>>>>>>>>>>>            {
>>>>>>>>>>>>              /* Only need vector extracting when there are more
>>>>>>>>>>>>                 than one stores.  */
>>>>>>>>>>>>              if (nstores > 1)
>>>>>>>>>>>>                inside_cost
>>>>>>>>>>>>                  += record_stmt_cost (cost_vec, 1, vec_to_scalar,
>>>>>>>>>>>>                                       stmt_info, 0, vect_body);
>>>>>>>>>>>>              /* Take a single lane vector type store as scalar
>>>>>>>>>>>>                 store to avoid ICE like 110776.  */
>>>>>>>>>>>>              if (VECTOR_TYPE_P (ltype)
>>>>>>>>>>>>                  && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
>>>>>>>>>>>>                n_adjacent_stores++;
>>>>>>>>>>>>              else
>>>>>>>>>>>>                inside_cost
>>>>>>>>>>>>                  += record_stmt_cost (cost_vec, 1, scalar_store,
>>>>>>>>>>>>                                       stmt_info, 0, vect_body);
>>>>>>>>>>>>              continue;
>>>>>>>>>>>>            }
>>>>>>>>>>>> 
>>>>>>>>>>>> Unfortunately, there's no easy way of telling whether a particular 
>>>>>>>>>>>> call
>>>>>>>>>>>> is part of a group, and if so, which member of the group it is.
>>>>>>>>>>>> 
>>>>>>>>>>>> I suppose we could give up on the attempt to be (somewhat) accurate
>>>>>>>>>>>> and just disable the optimisation.  Or we could restrict it to 
>>>>>>>>>>>> count > 1,
>>>>>>>>>>>> since it might still be useful for gathers and scatters.
>>>>>>>>>>> I tried restricting the calls to vect_is_store_elt_extraction to 
>>>>>>>>>>> count > 1 and it seems to resolve the issue of costing 
>>>>>>>>>>> vec_to_scalar operations with 0 (see patch below).
>>>>>>>>>>> What are your thoughts on this?
>>>>>>>>>> 
>>>>>>>>>> Why didn't you pursue instead moving the vec_to_scalar cost together
>>>>>>>>>> with the n_adjacent_store handling?
>>>>>>>>> When I continued working on this patch, we had already reached stage 
>>>>>>>>> 3 and I was hesitant to introduce changes to the middle-end that were 
>>>>>>>>> not previously covered by this patch. So I tried if the issue could 
>>>>>>>>> not be resolved by making a small change in the backend.
>>>>>>>>> If you still advise to use the n_adjacent_store instead, I’m happy to 
>>>>>>>>> look into it again.
>>>>>>>> 
>>>>>>>> If Richard's ok with adjusting vectorizable_store for GCC 15 (which it
>>>>>>>> sounds like he is), then I agree that would be better.  Otherwise we'd
>>>>>>>> be creating technical debt to clean up for GCC 16.  And it is a 
>>>>>>>> regression
>>>>>>>> of sorts, so is stage 3 material from that POV.
>>>>>>>> 
>>>>>>>> (Incidentally, AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS was itself a
>>>>>>>> "let's clean this up next stage 1" thing, since we needed to add tuning
>>>>>>>> for a new CPU late during the cycle.  But of course, there were other
>>>>>>>> priorities when stage 1 actually came around, so it never actually
>>>>>>>> happened.  Thanks again for being the one to sort this out.)
>>>>>>> Thanks for your feedback. Then I will try to make it work in 
>>>>>>> vectorizable_store.
>>>>>>> Best,
>>>>>>> Jennifer
>>>>>> Below is the updated patch with a suggestion for the changes in 
>>>>>> vectorizable_store. It resolves the issue with the vec_to_scalar 
>>>>>> operations that were individually costed with 0.
>>>>>> We already tested it on aarch64, no regression, but we are still doing 
>>>>>> performance testing.
>>>>>> Can you give some feedback in the meantime on the patch itself?
>>>>>> Thanks,
>>>>>> Jennifer
>>>>>> 
>>>>>> 
>>>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable 
>>>>>> and
>>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
>>>>>> default. To that end, the function aarch64_use_new_vector_costs_p and 
>>>>>> its uses
>>>>>> were removed. To prevent costing vec_to_scalar operations with 0, as
>>>>>> described in
>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
>>>>>> we adjusted vectorizable_store such that the variable n_adjacent_stores
>>>>>> also covers vec_to_scalar operations. This way vec_to_scalar operations
>>>>>> are not costed individually, but as a group.
>>>>>> 
>>>>>> Two tests were adjusted due to changes in codegen. In both cases, the
>>>>>> old code performed loop unrolling once, but the new code does not:
>>>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>>>>>> -moverride=tune=none):
>>>>>> f_int64_t_32:
>>>>>>    cbz     w3, .L92
>>>>>>    mov     x4, 0
>>>>>>    uxtw    x3, w3
>>>>>> +       cntd    x5
>>>>>> +       whilelo p7.d, xzr, x3
>>>>>> +       mov     z29.s, w5
>>>>>>    mov     z31.s, w2
>>>>>> -       whilelo p6.d, xzr, x3
>>>>>> -       mov     x2, x3
>>>>>> -       index   z30.s, #0, #1
>>>>>> -       uqdecd  x2
>>>>>> -       ptrue   p5.b, all
>>>>>> -       whilelo p7.d, xzr, x2
>>>>>> +       index   z30.d, #0, #1
>>>>>> +       ptrue   p6.b, all
>>>>>>    .p2align 3,,7
>>>>>> .L94:
>>>>>> -       ld1d    z27.d, p7/z, [x0, #1, mul vl]
>>>>>> -       ld1d    z28.d, p6/z, [x0]
>>>>>> -       movprfx z29, z31
>>>>>> -       mul     z29.s, p5/m, z29.s, z30.s
>>>>>> -       incw    x4
>>>>>> -       uunpklo z0.d, z29.s
>>>>>> -       uunpkhi z29.d, z29.s
>>>>>> -       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
>>>>>> -       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
>>>>>> -       add     z25.d, z28.d, z25.d
>>>>>> +       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
>>>>>> +       movprfx z28, z31
>>>>>> +       mul     z28.s, p6/m, z28.s, z30.s
>>>>>> +       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
>>>>>>    add     z26.d, z27.d, z26.d
>>>>>> -       st1d    z26.d, p7, [x0, #1, mul vl]
>>>>>> -       whilelo p7.d, x4, x2
>>>>>> -       st1d    z25.d, p6, [x0]
>>>>>> -       incw    z30.s
>>>>>> -       incb    x0, all, mul #2
>>>>>> -       whilelo p6.d, x4, x3
>>>>>> +       st1d    z26.d, p7, [x0, x4, lsl 3]
>>>>>> +       add     z30.s, z30.s, z29.s
>>>>>> +       incd    x4
>>>>>> +       whilelo p7.d, x4, x3
>>>>>>    b.any   .L94
>>>>>> .L92:
>>>>>>    ret
>>>>>> 
>>>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>>>>>> -moverride=tune=none):
>>>>>> f_int64_t_32:
>>>>>>    cbz     w3, .L84
>>>>>> -       addvl   x5, x1, #1
>>>>>>    mov     x4, 0
>>>>>>    uxtw    x3, w3
>>>>>> -       mov     z31.s, w2
>>>>>> +       cntd    x5
>>>>>>    whilelo p7.d, xzr, x3
>>>>>> -       mov     x2, x3
>>>>>> -       index   z30.s, #0, #1
>>>>>> -       uqdecd  x2
>>>>>> -       ptrue   p5.b, all
>>>>>> -       whilelo p6.d, xzr, x2
>>>>>> +       mov     z29.s, w5
>>>>>> +       mov     z31.s, w2
>>>>>> +       index   z30.d, #0, #1
>>>>>> +       ptrue   p6.b, all
>>>>>>    .p2align 3,,7
>>>>>> .L86:
>>>>>> -       ld1d    z28.d, p7/z, [x1, x4, lsl 3]
>>>>>> -       ld1d    z27.d, p6/z, [x5, x4, lsl 3]
>>>>>> -       movprfx z29, z30
>>>>>> -       mul     z29.s, p5/m, z29.s, z31.s
>>>>>> -       add     z28.d, z28.d, #1
>>>>>> -       uunpklo z26.d, z29.s
>>>>>> -       st1d    z28.d, p7, [x0, z26.d, lsl 3]
>>>>>> -       incw    x4
>>>>>> -       uunpkhi z29.d, z29.s
>>>>>> +       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
>>>>>> +       movprfx z28, z30
>>>>>> +       mul     z28.s, p6/m, z28.s, z31.s
>>>>>>    add     z27.d, z27.d, #1
>>>>>> -       whilelo p6.d, x4, x2
>>>>>> -       st1d    z27.d, p7, [x0, z29.d, lsl 3]
>>>>>> -       incw    z30.s
>>>>>> +       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
>>>>>> +       incd    x4
>>>>>> +       add     z30.s, z30.s, z29.s
>>>>>>    whilelo p7.d, x4, x3
>>>>>>    b.any   .L86
>>>>>> .L84:
>>>>>>    ret
>>>>>> 
>>>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no
>>>>>> regression.
>>>>>> OK for mainline?
>>>>>> 
>>>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>>>>> 
>>>>>> gcc/
>>>>>>    * tree-vect-stmts.cc (vectorizable_store): Extend the use of
>>>>>>    n_adjacent_stores to also cover vec_to_scalar operations.
>>>>>>    * config/aarch64/aarch64-tuning-flags.def: Remove
>>>>>>    use_new_vector_costs as tuning option.
>>>>>>    * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
>>>>>>    Remove.
>>>>>>    (aarch64_vector_costs::add_stmt_cost): Remove use of
>>>>>>    aarch64_use_new_vector_costs_p.
>>>>>>    (aarch64_vector_costs::finish_cost): Remove use of
>>>>>>    aarch64_use_new_vector_costs_p.
>>>>>>    * config/aarch64/tuning_models/cortexx925.h: Remove
>>>>>>    AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
>>>>>>    * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
>>>>>>    * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
>>>>>>    * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
>>>>>>    * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
>>>>>>    * config/aarch64/tuning_models/neoversen2.h: Likewise.
>>>>>>    * config/aarch64/tuning_models/neoversen3.h: Likewise.
>>>>>>    * config/aarch64/tuning_models/neoversev1.h: Likewise.
>>>>>>    * config/aarch64/tuning_models/neoversev2.h: Likewise.
>>>>>>    * config/aarch64/tuning_models/neoversev3.h: Likewise.
>>>>>>    * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
>>>>>> 
>>>>>> gcc/testsuite/
>>>>>>    * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
>>>>>>    * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
>>>>>> ---
>>>>>> gcc/config/aarch64/aarch64-tuning-flags.def   |  2 --
>>>>>> gcc/config/aarch64/aarch64.cc                 | 20 +++----------
>>>>>> gcc/config/aarch64/tuning_models/cortexx925.h |  1 -
>>>>>> .../aarch64/tuning_models/fujitsu_monaka.h    |  1 -
>>>>>> .../aarch64/tuning_models/generic_armv8_a.h   |  1 -
>>>>>> .../aarch64/tuning_models/generic_armv9_a.h   |  1 -
>>>>>> .../aarch64/tuning_models/neoverse512tvb.h    |  1 -
>>>>>> gcc/config/aarch64/tuning_models/neoversen2.h |  1 -
>>>>>> gcc/config/aarch64/tuning_models/neoversen3.h |  1 -
>>>>>> gcc/config/aarch64/tuning_models/neoversev1.h |  1 -
>>>>>> gcc/config/aarch64/tuning_models/neoversev2.h |  1 -
>>>>>> gcc/config/aarch64/tuning_models/neoversev3.h |  1 -
>>>>>> .../aarch64/tuning_models/neoversev3ae.h      |  1 -
>>>>>> .../gcc.target/aarch64/sve/strided_load_2.c   |  2 +-
>>>>>> .../gcc.target/aarch64/sve/strided_store_2.c  |  2 +-
>>>>>> gcc/tree-vect-stmts.cc                        | 29 ++++++++++---------
>>>>>> 16 files changed, 22 insertions(+), 44 deletions(-)
>>>>>> 
>>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
>>>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>> index ffbff20e29c..1de633c739b 100644
>>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", 
>>>>>> CHEAP_SHIFT_EXTEND)
>>>>>> 
>>>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", 
>>>>>> CSE_SVE_VL_CONSTANTS)
>>>>>> 
>>>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", 
>>>>>> USE_NEW_VECTOR_COSTS)
>>>>>> -
>>>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", 
>>>>>> MATCHED_VECTOR_THROUGHPUT)
>>>>>> 
>>>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", 
>>>>>> AVOID_CROSS_LOOP_FMA)
>>>>>> diff --git a/gcc/config/aarch64/aarch64.cc 
>>>>>> b/gcc/config/aarch64/aarch64.cc
>>>>>> index 77a2a6bfa3a..71fba9cc63b 100644
>>>>>> --- a/gcc/config/aarch64/aarch64.cc
>>>>>> +++ b/gcc/config/aarch64/aarch64.cc
>>>>>> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info 
>>>>>> *vinfo, bool costing_for_scalar)
>>>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar);
>>>>>> }
>>>>>> 
>>>>>> -/* Return true if the current CPU should use the new costs defined
>>>>>> -   in GCC 11.  This should be removed for GCC 12 and above, with the
>>>>>> -   costs applying to all CPUs instead.  */
>>>>>> -static bool
>>>>>> -aarch64_use_new_vector_costs_p ()
>>>>>> -{
>>>>>> -  return (aarch64_tune_params.extra_tuning_flags
>>>>>> -         & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
>>>>>> -}
>>>>>> -
>>>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE.  */
>>>>>> static const simd_vec_cost *
>>>>>> aarch64_simd_vec_costs (tree vectype)
>>>>>> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>>>>>> vect_cost_for_stmt kind,
>>>>>> 
>>>>>> /* Do one-time initialization based on the vinfo.  */
>>>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
>>>>>> -  if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
>>>>>> +  if (!m_analyzed_vinfo)
>>>>>> {
>>>>>>   if (loop_vinfo)
>>>>>>    analyze_loop_vinfo (loop_vinfo);
>>>>>> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>>>>>> vect_cost_for_stmt kind,
>>>>>> 
>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
>>>>>>  of just looking at KIND.  */
>>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>>> +  if (stmt_info)
>>>>>> {
>>>>>>   /* If we scalarize a strided store, the vectorizer costs one
>>>>>>     vec_to_scalar for each element.  However, we can store the first
>>>>>> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>>>>>> vect_cost_for_stmt kind,
>>>>>> else
>>>>>> m_num_last_promote_demote = 0;
>>>>>> 
>>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>>> +  if (stmt_info)
>>>>>> {
>>>>>>   /* Account for any extra "embedded" costs that apply additively
>>>>>>     to the base cost calculated above.  */
>>>>>> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const 
>>>>>> vector_costs *uncast_scalar_costs)
>>>>>> 
>>>>>> auto *scalar_costs
>>>>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
>>>>>> -  if (loop_vinfo
>>>>>> -      && m_vec_flags
>>>>>> -      && aarch64_use_new_vector_costs_p ())
>>>>>> +  if (loop_vinfo && m_vec_flags)
>>>>>> {
>>>>>>   m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>>>>>                                         m_costs[vect_body]);
>>>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
>>>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>>> index b2ff716157a..0a8eff69307 100644
>>>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params cortexx925_tunings =
>>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>>>>>> &generic_prefetch_tune,
>>>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h 
>>>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>>> index 2d704ecd110..a564528f43d 100644
>>>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>>> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings 
>>>>>> =
>>>>>> 0,   /* max_case_values.  */
>>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>>>>> &generic_prefetch_tune,
>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h 
>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>>> index bdd309ab03d..f090d5cde50 100644
>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>>> @@ -183,7 +183,6 @@ static const struct tune_params 
>>>>>> generic_armv8_a_tunings =
>>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>>>>> &generic_prefetch_tune,
>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>>> index a05a9ab92a2..4c33c147444 100644
>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>>> @@ -249,7 +249,6 @@ static const struct tune_params 
>>>>>> generic_armv9_a_tunings =
>>>>>> 0,   /* max_case_values.  */
>>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>>>>> &generic_armv9a_prefetch_tune,
>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h 
>>>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>>> index c407b89a22f..fe4f7c10f73 100644
>>>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>>> @@ -156,7 +156,6 @@ static const struct tune_params 
>>>>>> neoverse512tvb_tunings =
>>>>>> 0,   /* max_case_values.  */
>>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>>>>> &generic_prefetch_tune,
>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
>>>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>>> index fd5f8f37370..0c74068da2c 100644
>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings =
>>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>>>>>> &generic_prefetch_tune,
>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h 
>>>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>>> index 8b156c2fe4d..9d4e1be171a 100644
>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings =
>>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>>>>> &generic_prefetch_tune,
>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
>>>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>>> index 23c121d8652..85a78bb2bef 100644
>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>>> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings =
>>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>>>>>> &generic_prefetch_tune,
>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
>>>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>>> index 40af5f47f4f..1dd452beb8d 100644
>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>>> @@ -232,7 +232,6 @@ static const struct tune_params neoversev2_tunings =
>>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
>>>>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),  /* tune_flags.  */
>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h 
>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>>> index d65d74bfecf..d0ba5b1aef6 100644
>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings =
>>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>>>>>> &generic_prefetch_tune,
>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h 
>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>>> index 7b7fa0b4b08..a1572048503 100644
>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings 
>>>>>> =
>>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>>>>>> &generic_prefetch_tune,
>>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c 
>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>>> index 762805ff54b..c334b7a6875 100644
>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>>> @@ -15,4 +15,4 @@
>>>>>> so we vectorize the offset calculation.  This means that the
>>>>>> 64-bit version needs two copies.  */
>>>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, 
>>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c 
>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>>> index f0ea58e38e2..94cc63049bc 100644
>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>>> @@ -15,4 +15,4 @@
>>>>>> so we vectorize the offset calculation.  This means that the
>>>>>> 64-bit version needs two copies.  */
>>>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], 
>>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
>>>>>> index be1139a423c..6d7d28c4702 100644
>>>>>> --- a/gcc/tree-vect-stmts.cc
>>>>>> +++ b/gcc/tree-vect-stmts.cc
>>>>>> @@ -8834,19 +8834,16 @@ vectorizable_store (vec_info *vinfo,
>>>>>>            {
>>>>>>              if (costing_p)
>>>>>>                {
>>>>>> -                     /* Only need vector extracting when there are more
>>>>>> -                        than one stores.  */
>>>>>> -                     if (nstores > 1)
>>>>>> -                       inside_cost
>>>>>> -                         += record_stmt_cost (cost_vec, 1, 
>>>>>> vec_to_scalar,
>>>>>> -                                              stmt_info, slp_node,
>>>>>> -                                              0, vect_body);
>>>>>>                  /* Take a single lane vector type store as scalar
>>>>>>                     store to avoid ICE like 110776.  */
>>>>>> -                     if (VECTOR_TYPE_P (ltype)
>>>>>> -                         && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
>>>>>> +                     bool single_lane_vec_p =
>>>>>> +                       VECTOR_TYPE_P (ltype)
>>>>>> +                       && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U);
>>>>>> +                     /* Only need vector extracting when there are more
>>>>>> +                        than one stores.  */
>>>>>> +                     if (nstores > 1 || single_lane_vec_p)
>>>>>>                    n_adjacent_stores++;
>>>>>> -                     else
>>>>>> +                     if (!single_lane_vec_p)
>>>>> 
>>>>> I think it's somewhat non-obvious that nstores > 1 and single_lane_vec_p
>>>>> correlate.  In fact I think that we always record a store, just for
>>>>> single-element
>>>>> vectors we record scalar stores.  I suggest to here always to just
>>>>> n_adjacent_stores++
>>>>> and below ...
>>>>> 
>>>>>>                    inside_cost
>>>>>>                      += record_stmt_cost (cost_vec, 1, scalar_store,
>>>>>>                                           stmt_info, 0, vect_body);
>>>>>> @@ -8905,9 +8902,15 @@ vectorizable_store (vec_info *vinfo,
>>>>>>   if (costing_p)
>>>>>>    {
>>>>>>      if (n_adjacent_stores > 0)
>>>>>> -           vect_get_store_cost (vinfo, stmt_info, slp_node, 
>>>>>> n_adjacent_stores,
>>>>>> -                                alignment_support_scheme, misalignment,
>>>>>> -                                &inside_cost, cost_vec);
>>>>>> +           {
>>>>>> +             vect_get_store_cost (vinfo, stmt_info, slp_node, 
>>>>>> n_adjacent_stores,
>>>>>> +                                  alignment_support_scheme, 
>>>>>> misalignment,
>>>>>> +                                  &inside_cost, cost_vec);
>>>>> 
>>>>> ... record n_adjacent_stores scalar_store when ltype is single-lane and 
>>>>> record
>>>>> n_adjacent_stores vect_to_scalar if nstores > 1 (and else none).
>>>>> 
>>>>> Richard.
>>>> Thanks for the feedback, I’m glad it’s going in the right direction. Below 
>>>> is the updated patch, re-validated on aarch64.
>>>> Thanks, Jennifer
>>>> 
>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and
>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
>>>> default. To that end, the function aarch64_use_new_vector_costs_p and its 
>>>> uses
>>>> were removed. To prevent costing vec_to_scalar operations with 0, as
>>>> described in
>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
>>>> we adjusted vectorizable_store such that the variable n_adjacent_stores
>>>> also covers vec_to_scalar operations. This way vec_to_scalar operations
>>>> are not costed individually, but as a group.
>>>> 
>>>> Two tests were adjusted due to changes in codegen. In both cases, the
>>>> old code performed loop unrolling once, but the new code does not:
>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>>>> -moverride=tune=none):
>>>> f_int64_t_32:
>>>>     cbz     w3, .L92
>>>>     mov     x4, 0
>>>>     uxtw    x3, w3
>>>> +       cntd    x5
>>>> +       whilelo p7.d, xzr, x3
>>>> +       mov     z29.s, w5
>>>>     mov     z31.s, w2
>>>> -       whilelo p6.d, xzr, x3
>>>> -       mov     x2, x3
>>>> -       index   z30.s, #0, #1
>>>> -       uqdecd  x2
>>>> -       ptrue   p5.b, all
>>>> -       whilelo p7.d, xzr, x2
>>>> +       index   z30.d, #0, #1
>>>> +       ptrue   p6.b, all
>>>>     .p2align 3,,7
>>>> .L94:
>>>> -       ld1d    z27.d, p7/z, [x0, #1, mul vl]
>>>> -       ld1d    z28.d, p6/z, [x0]
>>>> -       movprfx z29, z31
>>>> -       mul     z29.s, p5/m, z29.s, z30.s
>>>> -       incw    x4
>>>> -       uunpklo z0.d, z29.s
>>>> -       uunpkhi z29.d, z29.s
>>>> -       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
>>>> -       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
>>>> -       add     z25.d, z28.d, z25.d
>>>> +       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
>>>> +       movprfx z28, z31
>>>> +       mul     z28.s, p6/m, z28.s, z30.s
>>>> +       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
>>>>     add     z26.d, z27.d, z26.d
>>>> -       st1d    z26.d, p7, [x0, #1, mul vl]
>>>> -       whilelo p7.d, x4, x2
>>>> -       st1d    z25.d, p6, [x0]
>>>> -       incw    z30.s
>>>> -       incb    x0, all, mul #2
>>>> -       whilelo p6.d, x4, x3
>>>> +       st1d    z26.d, p7, [x0, x4, lsl 3]
>>>> +       add     z30.s, z30.s, z29.s
>>>> +       incd    x4
>>>> +       whilelo p7.d, x4, x3
>>>>     b.any   .L94
>>>> .L92:
>>>>     ret
>>>> 
>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>>>> -moverride=tune=none):
>>>> f_int64_t_32:
>>>>     cbz     w3, .L84
>>>> -       addvl   x5, x1, #1
>>>>     mov     x4, 0
>>>>     uxtw    x3, w3
>>>> -       mov     z31.s, w2
>>>> +       cntd    x5
>>>>     whilelo p7.d, xzr, x3
>>>> -       mov     x2, x3
>>>> -       index   z30.s, #0, #1
>>>> -       uqdecd  x2
>>>> -       ptrue   p5.b, all
>>>> -       whilelo p6.d, xzr, x2
>>>> +       mov     z29.s, w5
>>>> +       mov     z31.s, w2
>>>> +       index   z30.d, #0, #1
>>>> +       ptrue   p6.b, all
>>>>     .p2align 3,,7
>>>> .L86:
>>>> -       ld1d    z28.d, p7/z, [x1, x4, lsl 3]
>>>> -       ld1d    z27.d, p6/z, [x5, x4, lsl 3]
>>>> -       movprfx z29, z30
>>>> -       mul     z29.s, p5/m, z29.s, z31.s
>>>> -       add     z28.d, z28.d, #1
>>>> -       uunpklo z26.d, z29.s
>>>> -       st1d    z28.d, p7, [x0, z26.d, lsl 3]
>>>> -       incw    x4
>>>> -       uunpkhi z29.d, z29.s
>>>> +       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
>>>> +       movprfx z28, z30
>>>> +       mul     z28.s, p6/m, z28.s, z31.s
>>>>     add     z27.d, z27.d, #1
>>>> -       whilelo p6.d, x4, x2
>>>> -       st1d    z27.d, p7, [x0, z29.d, lsl 3]
>>>> -       incw    z30.s
>>>> +       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
>>>> +       incd    x4
>>>> +       add     z30.s, z30.s, z29.s
>>>>     whilelo p7.d, x4, x3
>>>>     b.any   .L86
>>>> .L84:
>>>> ret
>>>> 
>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no
>>>> regression.
>>>> OK for mainline?
>>>> 
>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>>> 
>>>> gcc/
>>>> * tree-vect-stmts.cc (vectorizable_store): Extend the use of
>>>> n_adjacent_stores to also cover vec_to_scalar operations.
>>>> * config/aarch64/aarch64-tuning-flags.def: Remove
>>>> use_new_vector_costs as tuning option.
>>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
>>>> Remove.
>>>> (aarch64_vector_costs::add_stmt_cost): Remove use of
>>>> aarch64_use_new_vector_costs_p.
>>>> (aarch64_vector_costs::finish_cost): Remove use of
>>>> aarch64_use_new_vector_costs_p.
>>>> * config/aarch64/tuning_models/cortexx925.h: Remove
>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
>>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
>>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
>>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
>>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
>>>> * config/aarch64/tuning_models/neoversen2.h: Likewise.
>>>> * config/aarch64/tuning_models/neoversen3.h: Likewise.
>>>> * config/aarch64/tuning_models/neoversev1.h: Likewise.
>>>> * config/aarch64/tuning_models/neoversev2.h: Likewise.
>>>> * config/aarch64/tuning_models/neoversev3.h: Likewise.
>>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
>>>> 
>>>> gcc/testsuite/
>>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
>>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
>>>> ---
>>>> gcc/config/aarch64/aarch64-tuning-flags.def   |  2 -
>>>> gcc/config/aarch64/aarch64.cc                 | 20 ++--------
>>>> gcc/config/aarch64/tuning_models/cortexx925.h |  1 -
>>>> .../aarch64/tuning_models/fujitsu_monaka.h    |  1 -
>>>> .../aarch64/tuning_models/generic_armv8_a.h   |  1 -
>>>> .../aarch64/tuning_models/generic_armv9_a.h   |  1 -
>>>> .../aarch64/tuning_models/neoverse512tvb.h    |  1 -
>>>> gcc/config/aarch64/tuning_models/neoversen2.h |  1 -
>>>> gcc/config/aarch64/tuning_models/neoversen3.h |  1 -
>>>> gcc/config/aarch64/tuning_models/neoversev1.h |  1 -
>>>> gcc/config/aarch64/tuning_models/neoversev2.h |  1 -
>>>> gcc/config/aarch64/tuning_models/neoversev3.h |  1 -
>>>> .../aarch64/tuning_models/neoversev3ae.h      |  1 -
>>>> .../gcc.target/aarch64/sve/strided_load_2.c   |  2 +-
>>>> .../gcc.target/aarch64/sve/strided_store_2.c  |  2 +-
>>>> gcc/tree-vect-stmts.cc                        | 37 +++++++++++--------
>>>> 16 files changed, 27 insertions(+), 47 deletions(-)
>>>> 
>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>> index ffbff20e29c..1de633c739b 100644
>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", 
>>>> CHEAP_SHIFT_EXTEND)
>>>> 
>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
>>>> 
>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS)
>>>> -
>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", 
>>>> MATCHED_VECTOR_THROUGHPUT)
>>>> 
>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA)
>>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>>>> index 77a2a6bfa3a..71fba9cc63b 100644
>>>> --- a/gcc/config/aarch64/aarch64.cc
>>>> +++ b/gcc/config/aarch64/aarch64.cc
>>>> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo, 
>>>> bool costing_for_scalar)
>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar);
>>>> }
>>>> 
>>>> -/* Return true if the current CPU should use the new costs defined
>>>> -   in GCC 11.  This should be removed for GCC 12 and above, with the
>>>> -   costs applying to all CPUs instead.  */
>>>> -static bool
>>>> -aarch64_use_new_vector_costs_p ()
>>>> -{
>>>> -  return (aarch64_tune_params.extra_tuning_flags
>>>> -      & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
>>>> -}
>>>> -
>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE. */
>>>> static const simd_vec_cost *
>>>> aarch64_simd_vec_costs (tree vectype)
>>>> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>>>> vect_cost_for_stmt kind,
>>>> 
>>>> /* Do one-time initialization based on the vinfo.  */
>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
>>>> -  if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
>>>> +  if (!m_analyzed_vinfo)
>>>>  {
>>>>    if (loop_vinfo)
>>>> analyze_loop_vinfo (loop_vinfo);
>>>> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>>>> vect_cost_for_stmt kind,
>>>> 
>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
>>>>   of just looking at KIND.  */
>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>> +  if (stmt_info)
>>>>  {
>>>>    /* If we scalarize a strided store, the vectorizer costs one
>>>>  vec_to_scalar for each element.  However, we can store the first
>>>> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>>>> vect_cost_for_stmt kind,
>>>> else
>>>>  m_num_last_promote_demote = 0;
>>>> 
>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>> +  if (stmt_info)
>>>>  {
>>>>    /* Account for any extra "embedded" costs that apply additively
>>>>  to the base cost calculated above.  */
>>>> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const 
>>>> vector_costs *uncast_scalar_costs)
>>>> 
>>>> auto *scalar_costs
>>>>  = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
>>>> -  if (loop_vinfo
>>>> -      && m_vec_flags
>>>> -      && aarch64_use_new_vector_costs_p ())
>>>> +  if (loop_vinfo && m_vec_flags)
>>>>  {
>>>>    m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>>>                      m_costs[vect_body]);
>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>> index 5ebaf66e986..74772f3e15f 100644
>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>> @@ -221,7 +221,6 @@ static const struct tune_params cortexx925_tunings =
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_BASE
>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),    /* tune_flags.  */
>>>> &generic_armv9a_prefetch_tune,
>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h 
>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>> index 2d704ecd110..a564528f43d 100644
>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings =
>>>> 0,    /* max_case_values.  */
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_BASE
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags. */
>>>> &generic_prefetch_tune,
>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h 
>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>> index bdd309ab03d..f090d5cde50 100644
>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>> @@ -183,7 +183,6 @@ static const struct tune_params 
>>>> generic_armv8_a_tunings =
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_BASE
>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags. */
>>>> &generic_prefetch_tune,
>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>> index 785e00946bc..7b5821183bc 100644
>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>> @@ -251,7 +251,6 @@ static const struct tune_params 
>>>> generic_armv9_a_tunings =
>>>> 0,    /* max_case_values.  */
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_BASE
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags. */
>>>> &generic_armv9a_prefetch_tune,
>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h 
>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>> index 007f987154c..f7457df59e5 100644
>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>> @@ -156,7 +156,6 @@ static const struct tune_params neoverse512tvb_tunings 
>>>> =
>>>> 0,    /* max_case_values.  */
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags. */
>>>> &generic_armv9a_prefetch_tune,
>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>> index 32560d2f5f8..541b61c8179 100644
>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings =
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_BASE
>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),    /* tune_flags.  */
>>>> &generic_armv9a_prefetch_tune,
>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h 
>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>> index 2010bc4645b..eff668132a8 100644
>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings =
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_BASE
>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags. */
>>>> &generic_armv9a_prefetch_tune,
>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>> index c3751e32696..d11472b6e1e 100644
>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings =
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_BASE
>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),    /* tune_flags.  */
>>>> &generic_armv9a_prefetch_tune,
>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>> index 80dbe5c806c..ee77ffdd3bc 100644
>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev2_tunings =
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_BASE
>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
>>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),    /* tune_flags.  */
>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h 
>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>> index efe09e16d1e..6ef143ef7d5 100644
>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings =
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_BASE
>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),    /* tune_flags.  */
>>>> &generic_armv9a_prefetch_tune,
>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h 
>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>> index 66849f30889..96bdbf971f1 100644
>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings =
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_BASE
>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),    /* tune_flags.  */
>>>> &generic_armv9a_prefetch_tune,
>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c 
>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>> index 762805ff54b..c334b7a6875 100644
>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>> @@ -15,4 +15,4 @@
>>>> so we vectorize the offset calculation.  This means that the
>>>> 64-bit version needs two copies.  */
>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, 
>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c 
>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>> index f0ea58e38e2..94cc63049bc 100644
>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>> @@ -15,4 +15,4 @@
>>>> so we vectorize the offset calculation.  This means that the
>>>> 64-bit version needs two copies.  */
>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], 
>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
>>>> index be1139a423c..ab57163c243 100644
>>>> --- a/gcc/tree-vect-stmts.cc
>>>> +++ b/gcc/tree-vect-stmts.cc
>>>> @@ -8834,19 +8834,8 @@ vectorizable_store (vec_info *vinfo,
>>>>     {
>>>>       if (costing_p)
>>>>         {
>>>> -              /* Only need vector extracting when there are more
>>>> -             than one stores.  */
>>>> -              if (nstores > 1)
>>>> -            inside_cost
>>>> -              += record_stmt_cost (cost_vec, 1, vec_to_scalar,
>>>> -                           stmt_info, slp_node,
>>>> -                           0, vect_body);
>>>> -              /* Take a single lane vector type store as scalar
>>>> -             store to avoid ICE like 110776.  */
>>>> -              if (VECTOR_TYPE_P (ltype)
>>>> -              && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
>>>> -            n_adjacent_stores++;
>>>> -              else
>>>> +              n_adjacent_stores++;
>>>> +              if (!VECTOR_TYPE_P (ltype))
>>> 
>>> This should be combined with the Single lane Vector case belle
>>> 
>>>>         inside_cost
>>>>           += record_stmt_cost (cost_vec, 1, scalar_store,
>>>>                        stmt_info, 0, vect_body);
>>>> @@ -8905,9 +8894,25 @@ vectorizable_store (vec_info *vinfo,
>>>>    if (costing_p)
>>>> {
>>>>   if (n_adjacent_stores > 0)
>>>> -        vect_get_store_cost (vinfo, stmt_info, slp_node, 
>>>> n_adjacent_stores,
>>>> -                 alignment_support_scheme, misalignment,
>>>> -                 &inside_cost, cost_vec);
>>>> +        {
>>>> +          /* Take a single lane vector type store as scalar
>>>> +         store to avoid ICE like 110776.  */
>>>> +          if (VECTOR_TYPE_P (ltype)
>>>> +          && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
>>>> +        inside_cost
>>>> +          += record_stmt_cost (cost_vec, n_adjacent_stores,
>>>> +                       scalar_store, stmt_info, 0, vect_body);
>>>> +          /* Only need vector extracting when there are more
>>>> +         than one stores.  */
>>>> +          if (nstores > 1)
>>>> +        inside_cost
>>>> +          += record_stmt_cost (cost_vec, n_adjacent_stores,
>>>> +                       vec_to_scalar, stmt_info, slp_node,
>>>> +                       0, vect_body);
>>>> +          vect_get_store_cost (vinfo, stmt_info, slp_node,
>>> 
>>> This should be Inlay done for Multi-lane vectors
>> Thanks for the quick reply. As I am making the changes, I am wondering: Do 
>> we even need n_adjacent_stores anymore? It appears to always have the same 
>> value as nstores. Can we remove it and use nstores instead or does it still 
>> serve another purpose?
> 
> It was a heuristic needed for powerpc(?), can you confirm we’re not combining 
> stores from VF unrolling for strided SLP stores?
Hi Richard,
the reasoning behind my suggestion to replace n_adjacent_stores by nstores in 
this code section is that with my patch they will logically always have the 
same value.


Having said that, I looked into why n_adjacent_stores was introduced in the 
first place: The patch [1] that introduced n_adjacent_stores fixed a regression 
on aarch64 by costing vector loads/stores together. The variables 
n_adjacent_stores and n_adjacent_loads were added in two code sections each in 
vectorizable_store and vectorizable_load. The connection to PowerPC you 
recalled is also mentioned in the PR, but I believe it refers to the enum 
dr_alignment_support alignment_support_scheme that is used in 

vect_get_store_cost (vinfo, stmt_info, slp_node,
                     _adjacent_stores, alignment_support_scheme,
                     misalignment, &inside_cost, cost_vec);

to which I made no changes other than refactoring the if-statement around it.

So, taking the fact that n_adjacent_stores has been introduced in multiple 
locations into account I would actually leave n_adjacent_stores in the code 
section that I made changes to in order to keep vectorizable_store and 
vectorizable_load consistent.

Regarding your question about not combining stores from loop unrolling for 
strided SLP stores: I'm not entirely sure what you mean, but could it be 
covered by the tests gcc.target/aarch64/ldp_stp_* that were also mentioned in 
[1]?

I added the changes you proposed in the updated patch below, but kept 
n_adjacent_stores. The patch was re-validated on aarch64.
Thanks,
Jennifer

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111784#c3


This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and
use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
default. To that end, the function aarch64_use_new_vector_costs_p and its uses
were removed. To prevent costing vec_to_scalar operations with 0, as
described in
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
we adjusted vectorizable_store such that the variable n_adjacent_stores
also covers vec_to_scalar operations. This way vec_to_scalar operations
are not costed individually, but as a group.

Two tests were adjusted due to changes in codegen. In both cases, the
old code performed loop unrolling once, but the new code does not:
Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
-O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -moverride=tune=none):
f_int64_t_32:
        cbz     w3, .L92
        mov     x4, 0
        uxtw    x3, w3
+       cntd    x5
+       whilelo p7.d, xzr, x3
+       mov     z29.s, w5
        mov     z31.s, w2
-       whilelo p6.d, xzr, x3
-       mov     x2, x3
-       index   z30.s, #0, #1
-       uqdecd  x2
-       ptrue   p5.b, all
-       whilelo p7.d, xzr, x2
+       index   z30.d, #0, #1
+       ptrue   p6.b, all
        .p2align 3,,7
 .L94:
-       ld1d    z27.d, p7/z, [x0, #1, mul vl]
-       ld1d    z28.d, p6/z, [x0]
-       movprfx z29, z31
-       mul     z29.s, p5/m, z29.s, z30.s
-       incw    x4
-       uunpklo z0.d, z29.s
-       uunpkhi z29.d, z29.s
-       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
-       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
-       add     z25.d, z28.d, z25.d
+       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
+       movprfx z28, z31
+       mul     z28.s, p6/m, z28.s, z30.s
+       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
        add     z26.d, z27.d, z26.d
-       st1d    z26.d, p7, [x0, #1, mul vl]
-       whilelo p7.d, x4, x2
-       st1d    z25.d, p6, [x0]
-       incw    z30.s
-       incb    x0, all, mul #2
-       whilelo p6.d, x4, x3
+       st1d    z26.d, p7, [x0, x4, lsl 3]
+       add     z30.s, z30.s, z29.s
+       incd    x4
+       whilelo p7.d, x4, x3
        b.any   .L94
 .L92:
        ret

Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
-O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -moverride=tune=none):
f_int64_t_32:
        cbz     w3, .L84
-       addvl   x5, x1, #1
        mov     x4, 0
        uxtw    x3, w3
-       mov     z31.s, w2
+       cntd    x5
        whilelo p7.d, xzr, x3
-       mov     x2, x3
-       index   z30.s, #0, #1
-       uqdecd  x2
-       ptrue   p5.b, all
-       whilelo p6.d, xzr, x2
+       mov     z29.s, w5
+       mov     z31.s, w2
+       index   z30.d, #0, #1
+       ptrue   p6.b, all
        .p2align 3,,7
 .L86:
-       ld1d    z28.d, p7/z, [x1, x4, lsl 3]
-       ld1d    z27.d, p6/z, [x5, x4, lsl 3]
-       movprfx z29, z30
-       mul     z29.s, p5/m, z29.s, z31.s
-       add     z28.d, z28.d, #1
-       uunpklo z26.d, z29.s
-       st1d    z28.d, p7, [x0, z26.d, lsl 3]
-       incw    x4
-       uunpkhi z29.d, z29.s
+       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
+       movprfx z28, z30
+       mul     z28.s, p6/m, z28.s, z31.s
        add     z27.d, z27.d, #1
-       whilelo p6.d, x4, x2
-       st1d    z27.d, p7, [x0, z29.d, lsl 3]
-       incw    z30.s
+       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
+       incd    x4
+       add     z30.s, z30.s, z29.s
        whilelo p7.d, x4, x3
        b.any   .L86
 .L84:
        ret

The patch was bootstrapped and tested on aarch64-linux-gnu, no
regression.
OK for mainline?

Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>

gcc/
        * tree-vect-stmts.cc (vectorizable_store): Extend the use of
        n_adjacent_stores to also cover vec_to_scalar operations.
        * config/aarch64/aarch64-tuning-flags.def: Remove
        use_new_vector_costs as tuning option.
        * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
        Remove.
        (aarch64_vector_costs::add_stmt_cost): Remove use of
        aarch64_use_new_vector_costs_p.
        (aarch64_vector_costs::finish_cost): Remove use of
        aarch64_use_new_vector_costs_p.
        * config/aarch64/tuning_models/cortexx925.h: Remove
        AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
        * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
        * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
        * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
        * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
        * config/aarch64/tuning_models/neoversen2.h: Likewise.
        * config/aarch64/tuning_models/neoversen3.h: Likewise.
        * config/aarch64/tuning_models/neoversev1.h: Likewise.
        * config/aarch64/tuning_models/neoversev2.h: Likewise.
        * config/aarch64/tuning_models/neoversev3.h: Likewise.
        * config/aarch64/tuning_models/neoversev3ae.h: Likewise.

gcc/testsuite/
        * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
        * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
---
 gcc/config/aarch64/aarch64-tuning-flags.def   |  2 -
 gcc/config/aarch64/aarch64.cc                 | 20 ++--------
 gcc/config/aarch64/tuning_models/cortexx925.h |  1 -
 .../aarch64/tuning_models/fujitsu_monaka.h    |  1 -
 .../aarch64/tuning_models/generic_armv8_a.h   |  1 -
 .../aarch64/tuning_models/generic_armv9_a.h   |  1 -
 .../aarch64/tuning_models/neoverse512tvb.h    |  1 -
 gcc/config/aarch64/tuning_models/neoversen2.h |  1 -
 gcc/config/aarch64/tuning_models/neoversen3.h |  1 -
 gcc/config/aarch64/tuning_models/neoversev1.h |  1 -
 gcc/config/aarch64/tuning_models/neoversev2.h |  1 -
 gcc/config/aarch64/tuning_models/neoversev3.h |  1 -
 .../aarch64/tuning_models/neoversev3ae.h      |  1 -
 .../gcc.target/aarch64/sve/strided_load_2.c   |  2 +-
 .../gcc.target/aarch64/sve/strided_store_2.c  |  2 +-
 gcc/tree-vect-stmts.cc                        | 40 ++++++++++---------
 16 files changed, 27 insertions(+), 50 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
b/gcc/config/aarch64/aarch64-tuning-flags.def
index ffbff20e29c..1de633c739b 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", 
CHEAP_SHIFT_EXTEND)
 
 AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
 
-AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS)
-
 AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", 
MATCHED_VECTOR_THROUGHPUT)
 
 AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA)
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 77a2a6bfa3a..71fba9cc63b 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo, bool 
costing_for_scalar)
   return new aarch64_vector_costs (vinfo, costing_for_scalar);
 }
 
-/* Return true if the current CPU should use the new costs defined
-   in GCC 11.  This should be removed for GCC 12 and above, with the
-   costs applying to all CPUs instead.  */
-static bool
-aarch64_use_new_vector_costs_p ()
-{
-  return (aarch64_tune_params.extra_tuning_flags
-         & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
-}
-
 /* Return the appropriate SIMD costs for vectors of type VECTYPE.  */
 static const simd_vec_cost *
 aarch64_simd_vec_costs (tree vectype)
@@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
vect_cost_for_stmt kind,
 
   /* Do one-time initialization based on the vinfo.  */
   loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
-  if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
+  if (!m_analyzed_vinfo)
     {
       if (loop_vinfo)
        analyze_loop_vinfo (loop_vinfo);
@@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
vect_cost_for_stmt kind,
 
   /* Try to get a more accurate cost by looking at STMT_INFO instead
      of just looking at KIND.  */
-  if (stmt_info && aarch64_use_new_vector_costs_p ())
+  if (stmt_info)
     {
       /* If we scalarize a strided store, the vectorizer costs one
         vec_to_scalar for each element.  However, we can store the first
@@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
vect_cost_for_stmt kind,
   else
     m_num_last_promote_demote = 0;
 
-  if (stmt_info && aarch64_use_new_vector_costs_p ())
+  if (stmt_info)
     {
       /* Account for any extra "embedded" costs that apply additively
         to the base cost calculated above.  */
@@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const vector_costs 
*uncast_scalar_costs)
 
   auto *scalar_costs
     = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
-  if (loop_vinfo
-      && m_vec_flags
-      && aarch64_use_new_vector_costs_p ())
+  if (loop_vinfo && m_vec_flags)
     {
       m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
                                             m_costs[vect_body]);
diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
b/gcc/config/aarch64/tuning_models/cortexx925.h
index 5ebaf66e986..74772f3e15f 100644
--- a/gcc/config/aarch64/tuning_models/cortexx925.h
+++ b/gcc/config/aarch64/tuning_models/cortexx925.h
@@ -221,7 +221,6 @@ static const struct tune_params cortexx925_tunings =
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_BASE
    | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
    | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
   &generic_armv9a_prefetch_tune,
diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h 
b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
index 2d704ecd110..a564528f43d 100644
--- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
+++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
@@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings =
   0,   /* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_BASE
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
   &generic_prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h 
b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
index bdd309ab03d..f090d5cde50 100644
--- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
+++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
@@ -183,7 +183,6 @@ static const struct tune_params generic_armv8_a_tunings =
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_BASE
    | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
   &generic_prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
index 785e00946bc..7b5821183bc 100644
--- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
+++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
@@ -251,7 +251,6 @@ static const struct tune_params generic_armv9_a_tunings =
   0,   /* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_BASE
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
   &generic_armv9a_prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h 
b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
index 007f987154c..f7457df59e5 100644
--- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
+++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
@@ -156,7 +156,6 @@ static const struct tune_params neoverse512tvb_tunings =
   0,   /* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
   &generic_armv9a_prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
b/gcc/config/aarch64/tuning_models/neoversen2.h
index 32560d2f5f8..541b61c8179 100644
--- a/gcc/config/aarch64/tuning_models/neoversen2.h
+++ b/gcc/config/aarch64/tuning_models/neoversen2.h
@@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings =
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_BASE
    | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
    | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
   &generic_armv9a_prefetch_tune,
diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h 
b/gcc/config/aarch64/tuning_models/neoversen3.h
index 2010bc4645b..eff668132a8 100644
--- a/gcc/config/aarch64/tuning_models/neoversen3.h
+++ b/gcc/config/aarch64/tuning_models/neoversen3.h
@@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings =
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_BASE
    | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
   &generic_armv9a_prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
b/gcc/config/aarch64/tuning_models/neoversev1.h
index c3751e32696..d11472b6e1e 100644
--- a/gcc/config/aarch64/tuning_models/neoversev1.h
+++ b/gcc/config/aarch64/tuning_models/neoversev1.h
@@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings =
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_BASE
    | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
    | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
   &generic_armv9a_prefetch_tune,
diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
b/gcc/config/aarch64/tuning_models/neoversev2.h
index 80dbe5c806c..ee77ffdd3bc 100644
--- a/gcc/config/aarch64/tuning_models/neoversev2.h
+++ b/gcc/config/aarch64/tuning_models/neoversev2.h
@@ -219,7 +219,6 @@ static const struct tune_params neoversev2_tunings =
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_BASE
    | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
    | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
    | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),  /* tune_flags.  */
diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h 
b/gcc/config/aarch64/tuning_models/neoversev3.h
index efe09e16d1e..6ef143ef7d5 100644
--- a/gcc/config/aarch64/tuning_models/neoversev3.h
+++ b/gcc/config/aarch64/tuning_models/neoversev3.h
@@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings =
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_BASE
    | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
    | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
   &generic_armv9a_prefetch_tune,
diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h 
b/gcc/config/aarch64/tuning_models/neoversev3ae.h
index 66849f30889..96bdbf971f1 100644
--- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
+++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
@@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings =
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_BASE
    | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
    | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
   &generic_armv9a_prefetch_tune,
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c 
b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
index 762805ff54b..c334b7a6875 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
@@ -15,4 +15,4 @@
    so we vectorize the offset calculation.  This means that the
    64-bit version needs two copies.  */
 /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, \[x[0-9]+, 
z[0-9]+.s, uxtw 2\]\n} 3 } } */
-/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, \[x[0-9]+, 
z[0-9]+.d, lsl 3\]\n} 15 } } */
+/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, \[x[0-9]+, 
z[0-9]+.d, lsl 3\]\n} 9 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c 
b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
index f0ea58e38e2..94cc63049bc 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
@@ -15,4 +15,4 @@
    so we vectorize the offset calculation.  This means that the
    64-bit version needs two copies.  */
 /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], \[x[0-9]+, 
z[0-9]+.s, uxtw 2\]\n} 3 } } */
-/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[x[0-9]+, 
z[0-9]+.d, lsl 3\]\n} 15 } } */
+/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[x[0-9]+, 
z[0-9]+.d, lsl 3\]\n} 9 } } */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index be1139a423c..a14248193ca 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -8834,22 +8834,7 @@ vectorizable_store (vec_info *vinfo,
                {
                  if (costing_p)
                    {
-                     /* Only need vector extracting when there are more
-                        than one stores.  */
-                     if (nstores > 1)
-                       inside_cost
-                         += record_stmt_cost (cost_vec, 1, vec_to_scalar,
-                                              stmt_info, slp_node,
-                                              0, vect_body);
-                     /* Take a single lane vector type store as scalar
-                        store to avoid ICE like 110776.  */
-                     if (VECTOR_TYPE_P (ltype)
-                         && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
-                       n_adjacent_stores++;
-                     else
-                       inside_cost
-                         += record_stmt_cost (cost_vec, 1, scalar_store,
-                                              stmt_info, 0, vect_body);
+                     n_adjacent_stores++;
                      continue;
                    }
                  tree newref, newoff;
@@ -8905,9 +8890,26 @@ vectorizable_store (vec_info *vinfo,
       if (costing_p)
        {
          if (n_adjacent_stores > 0)
-           vect_get_store_cost (vinfo, stmt_info, slp_node, n_adjacent_stores,
-                                alignment_support_scheme, misalignment,
-                                &inside_cost, cost_vec);
+           {
+             /* Take a single lane vector type store as scalar
+                store to avoid ICE like 110776.  */
+             if (VECTOR_TYPE_P (ltype)
+                 && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
+               vect_get_store_cost (vinfo, stmt_info, slp_node,
+                                    n_adjacent_stores, 
alignment_support_scheme,
+                                    misalignment, &inside_cost, cost_vec);
+             else
+               inside_cost
+                 += record_stmt_cost (cost_vec, n_adjacent_stores,
+                                      scalar_store, stmt_info, 0, vect_body);
+             /* Only need vector extracting when there are more
+                than one stores.  */
+             if (nstores > 1)
+               inside_cost
+                 += record_stmt_cost (cost_vec, n_adjacent_stores,
+                                      vec_to_scalar, stmt_info, slp_node,
+                                      0, vect_body);
+           }
          if (dump_enabled_p ())
            dump_printf_loc (MSG_NOTE, vect_location,
                             "vect_model_store_cost: inside_cost = %d, "
-- 
2.44.0
> 
>> Thanks, Jennifer
>>> 
>>>> +                   n_adjacent_stores, alignment_support_scheme,
>>>> +                   misalignment, &inside_cost, cost_vec);
>>>> +        }
>>>>   if (dump_enabled_p ())
>>>>     dump_printf_loc (MSG_NOTE, vect_location,
>>>>              "vect_model_store_cost: inside_cost = %d, "
>>>> --
>>>> 2.34.1
>>>>> 
>>>>>> +             inside_cost
>>>>>> +               += record_stmt_cost (cost_vec, n_adjacent_stores, 
>>>>>> vec_to_scalar,
>>>>>> +                                    stmt_info, slp_node,
>>>>>> +                                    0, vect_body);
>>>>>> +           }
>>>>>>      if (dump_enabled_p ())
>>>>>>        dump_printf_loc (MSG_NOTE, vect_location,
>>>>>>                         "vect_model_store_cost: inside_cost = %d, "
>>>>>> --
>>>>>> 2.44.0
>>>>>> 
>>>>>> 
>>>>>>>> 
>>>>>>>> Richard
>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Jennifer
>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Jennifer
>>>>>>>>>>> 
>>>>>>>>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS 
>>>>>>>>>>> tunable and
>>>>>>>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
>>>>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
>>>>>>>>>>> default. To that end, the function aarch64_use_new_vector_costs_p 
>>>>>>>>>>> and its uses
>>>>>>>>>>> were removed. To prevent costing vec_to_scalar operations with 0, as
>>>>>>>>>>> described in
>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
>>>>>>>>>>> we guarded the call to vect_is_store_elt_extraction in
>>>>>>>>>>> aarch64_vector_costs::add_stmt_cost by count > 1.
>>>>>>>>>>> 
>>>>>>>>>>> Two tests were adjusted due to changes in codegen. In both cases, 
>>>>>>>>>>> the
>>>>>>>>>>> old code performed loop unrolling once, but the new code does not:
>>>>>>>>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
>>>>>>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>>>>>>>>>>> -moverride=tune=none):
>>>>>>>>>>> f_int64_t_32:
>>>>>>>>>>> cbz     w3, .L92
>>>>>>>>>>> mov     x4, 0
>>>>>>>>>>> uxtw    x3, w3
>>>>>>>>>>> +       cntd    x5
>>>>>>>>>>> +       whilelo p7.d, xzr, x3
>>>>>>>>>>> +       mov     z29.s, w5
>>>>>>>>>>> mov     z31.s, w2
>>>>>>>>>>> -       whilelo p6.d, xzr, x3
>>>>>>>>>>> -       mov     x2, x3
>>>>>>>>>>> -       index   z30.s, #0, #1
>>>>>>>>>>> -       uqdecd  x2
>>>>>>>>>>> -       ptrue   p5.b, all
>>>>>>>>>>> -       whilelo p7.d, xzr, x2
>>>>>>>>>>> +       index   z30.d, #0, #1
>>>>>>>>>>> +       ptrue   p6.b, all
>>>>>>>>>>> .p2align 3,,7
>>>>>>>>>>> .L94:
>>>>>>>>>>> -       ld1d    z27.d, p7/z, [x0, #1, mul vl]
>>>>>>>>>>> -       ld1d    z28.d, p6/z, [x0]
>>>>>>>>>>> -       movprfx z29, z31
>>>>>>>>>>> -       mul     z29.s, p5/m, z29.s, z30.s
>>>>>>>>>>> -       incw    x4
>>>>>>>>>>> -       uunpklo z0.d, z29.s
>>>>>>>>>>> -       uunpkhi z29.d, z29.s
>>>>>>>>>>> -       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
>>>>>>>>>>> -       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
>>>>>>>>>>> -       add     z25.d, z28.d, z25.d
>>>>>>>>>>> +       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
>>>>>>>>>>> +       movprfx z28, z31
>>>>>>>>>>> +       mul     z28.s, p6/m, z28.s, z30.s
>>>>>>>>>>> +       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
>>>>>>>>>>> add     z26.d, z27.d, z26.d
>>>>>>>>>>> -       st1d    z26.d, p7, [x0, #1, mul vl]
>>>>>>>>>>> -       whilelo p7.d, x4, x2
>>>>>>>>>>> -       st1d    z25.d, p6, [x0]
>>>>>>>>>>> -       incw    z30.s
>>>>>>>>>>> -       incb    x0, all, mul #2
>>>>>>>>>>> -       whilelo p6.d, x4, x3
>>>>>>>>>>> +       st1d    z26.d, p7, [x0, x4, lsl 3]
>>>>>>>>>>> +       add     z30.s, z30.s, z29.s
>>>>>>>>>>> +       incd    x4
>>>>>>>>>>> +       whilelo p7.d, x4, x3
>>>>>>>>>>> b.any   .L94
>>>>>>>>>>> .L92:
>>>>>>>>>>> ret
>>>>>>>>>>> 
>>>>>>>>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
>>>>>>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>>>>>>>>>>> -moverride=tune=none):
>>>>>>>>>>> f_int64_t_32:
>>>>>>>>>>> cbz     w3, .L84
>>>>>>>>>>> -       addvl   x5, x1, #1
>>>>>>>>>>> mov     x4, 0
>>>>>>>>>>> uxtw    x3, w3
>>>>>>>>>>> -       mov     z31.s, w2
>>>>>>>>>>> +       cntd    x5
>>>>>>>>>>> whilelo p7.d, xzr, x3
>>>>>>>>>>> -       mov     x2, x3
>>>>>>>>>>> -       index   z30.s, #0, #1
>>>>>>>>>>> -       uqdecd  x2
>>>>>>>>>>> -       ptrue   p5.b, all
>>>>>>>>>>> -       whilelo p6.d, xzr, x2
>>>>>>>>>>> +       mov     z29.s, w5
>>>>>>>>>>> +       mov     z31.s, w2
>>>>>>>>>>> +       index   z30.d, #0, #1
>>>>>>>>>>> +       ptrue   p6.b, all
>>>>>>>>>>> .p2align 3,,7
>>>>>>>>>>> .L86:
>>>>>>>>>>> -       ld1d    z28.d, p7/z, [x1, x4, lsl 3]
>>>>>>>>>>> -       ld1d    z27.d, p6/z, [x5, x4, lsl 3]
>>>>>>>>>>> -       movprfx z29, z30
>>>>>>>>>>> -       mul     z29.s, p5/m, z29.s, z31.s
>>>>>>>>>>> -       add     z28.d, z28.d, #1
>>>>>>>>>>> -       uunpklo z26.d, z29.s
>>>>>>>>>>> -       st1d    z28.d, p7, [x0, z26.d, lsl 3]
>>>>>>>>>>> -       incw    x4
>>>>>>>>>>> -       uunpkhi z29.d, z29.s
>>>>>>>>>>> +       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
>>>>>>>>>>> +       movprfx z28, z30
>>>>>>>>>>> +       mul     z28.s, p6/m, z28.s, z31.s
>>>>>>>>>>> add     z27.d, z27.d, #1
>>>>>>>>>>> -       whilelo p6.d, x4, x2
>>>>>>>>>>> -       st1d    z27.d, p7, [x0, z29.d, lsl 3]
>>>>>>>>>>> -       incw    z30.s
>>>>>>>>>>> +       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
>>>>>>>>>>> +       incd    x4
>>>>>>>>>>> +       add     z30.s, z30.s, z29.s
>>>>>>>>>>> whilelo p7.d, x4, x3
>>>>>>>>>>> b.any   .L86
>>>>>>>>>>> .L84:
>>>>>>>>>>> ret
>>>>>>>>>>> 
>>>>>>>>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no
>>>>>>>>>>> regression. We also ran SPEC2017 with -mcpu=generic on a Grace 
>>>>>>>>>>> machine and saw
>>>>>>>>>>> no non-noise impact on performance. We would appreciate help with 
>>>>>>>>>>> wider
>>>>>>>>>>> benchmarking on other platforms, if necessary.
>>>>>>>>>>> OK for mainline?
>>>>>>>>>>> 
>>>>>>>>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>>>>>>>>>> 
>>>>>>>>>>> gcc/
>>>>>>>>>>> * config/aarch64/aarch64-tuning-flags.def: Remove
>>>>>>>>>>> use_new_vector_costs as tuning option.
>>>>>>>>>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
>>>>>>>>>>> Remove.
>>>>>>>>>>> (aarch64_vector_costs::add_stmt_cost): Remove use of
>>>>>>>>>>> aarch64_use_new_vector_costs_p and guard call to
>>>>>>>>>>> vect_is_store_elt_extraction with count > 1.
>>>>>>>>>>> (aarch64_vector_costs::finish_cost): Remove use of
>>>>>>>>>>> aarch64_use_new_vector_costs_p.
>>>>>>>>>>> * config/aarch64/tuning_models/cortexx925.h: Remove
>>>>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
>>>>>>>>>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
>>>>>>>>>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
>>>>>>>>>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
>>>>>>>>>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
>>>>>>>>>>> * config/aarch64/tuning_models/neoversen2.h: Likewise.
>>>>>>>>>>> * config/aarch64/tuning_models/neoversen3.h: Likewise.
>>>>>>>>>>> * config/aarch64/tuning_models/neoversev1.h: Likewise.
>>>>>>>>>>> * config/aarch64/tuning_models/neoversev2.h: Likewise.
>>>>>>>>>>> * config/aarch64/tuning_models/neoversev3.h: Likewise.
>>>>>>>>>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
>>>>>>>>>>> 
>>>>>>>>>>> gcc/testsuite/
>>>>>>>>>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
>>>>>>>>>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
>>>>>>>>>>> ---
>>>>>>>>>>> gcc/config/aarch64/aarch64-tuning-flags.def   |  2 --
>>>>>>>>>>> gcc/config/aarch64/aarch64.cc                 | 22 
>>>>>>>>>>> +++++--------------
>>>>>>>>>>> gcc/config/aarch64/tuning_models/cortexx925.h |  1 -
>>>>>>>>>>> .../aarch64/tuning_models/fujitsu_monaka.h    |  1 -
>>>>>>>>>>> .../aarch64/tuning_models/generic_armv8_a.h   |  1 -
>>>>>>>>>>> .../aarch64/tuning_models/generic_armv9_a.h   |  1 -
>>>>>>>>>>> .../aarch64/tuning_models/neoverse512tvb.h    |  1 -
>>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversen2.h |  1 -
>>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversen3.h |  1 -
>>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev1.h |  1 -
>>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev2.h |  1 -
>>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev3.h |  1 -
>>>>>>>>>>> .../aarch64/tuning_models/neoversev3ae.h      |  1 -
>>>>>>>>>>> .../gcc.target/aarch64/sve/strided_load_2.c   |  2 +-
>>>>>>>>>>> .../gcc.target/aarch64/sve/strided_store_2.c  |  2 +-
>>>>>>>>>>> 15 files changed, 7 insertions(+), 32 deletions(-)
>>>>>>>>>>> 
>>>>>>>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
>>>>>>>>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>>>>>>> index 5939602576b..ed345b13ed3 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>>>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>>>>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION 
>>>>>>>>>>> ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
>>>>>>>>>>> 
>>>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", 
>>>>>>>>>>> CSE_SVE_VL_CONSTANTS)
>>>>>>>>>>> 
>>>>>>>>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", 
>>>>>>>>>>> USE_NEW_VECTOR_COSTS)
>>>>>>>>>>> -
>>>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", 
>>>>>>>>>>> MATCHED_VECTOR_THROUGHPUT)
>>>>>>>>>>> 
>>>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", 
>>>>>>>>>>> AVOID_CROSS_LOOP_FMA)
>>>>>>>>>>> diff --git a/gcc/config/aarch64/aarch64.cc 
>>>>>>>>>>> b/gcc/config/aarch64/aarch64.cc
>>>>>>>>>>> index 43238aefef2..03806671c97 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/aarch64.cc
>>>>>>>>>>> +++ b/gcc/config/aarch64/aarch64.cc
>>>>>>>>>>> @@ -16566,16 +16566,6 @@ aarch64_vectorize_create_costs (vec_info 
>>>>>>>>>>> *vinfo, bool costing_for_scalar)
>>>>>>>>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar);
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> -/* Return true if the current CPU should use the new costs defined
>>>>>>>>>>> -   in GCC 11.  This should be removed for GCC 12 and above, with 
>>>>>>>>>>> the
>>>>>>>>>>> -   costs applying to all CPUs instead.  */
>>>>>>>>>>> -static bool
>>>>>>>>>>> -aarch64_use_new_vector_costs_p ()
>>>>>>>>>>> -{
>>>>>>>>>>> -  return (aarch64_tune_params.extra_tuning_flags
>>>>>>>>>>> -       & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
>>>>>>>>>>> -}
>>>>>>>>>>> -
>>>>>>>>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE.  
>>>>>>>>>>> */
>>>>>>>>>>> static const simd_vec_cost *
>>>>>>>>>>> aarch64_simd_vec_costs (tree vectype)
>>>>>>>>>>> @@ -17494,7 +17484,7 @@ aarch64_vector_costs::add_stmt_cost (int 
>>>>>>>>>>> count, vect_cost_for_stmt kind,
>>>>>>>>>>> 
>>>>>>>>>>> /* Do one-time initialization based on the vinfo.  */
>>>>>>>>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
>>>>>>>>>>> -  if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
>>>>>>>>>>> +  if (!m_analyzed_vinfo)
>>>>>>>>>>> {
>>>>>>>>>>> if (loop_vinfo)
>>>>>>>>>>> analyze_loop_vinfo (loop_vinfo);
>>>>>>>>>>> @@ -17512,12 +17502,12 @@ aarch64_vector_costs::add_stmt_cost (int 
>>>>>>>>>>> count, vect_cost_for_stmt kind,
>>>>>>>>>>> 
>>>>>>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
>>>>>>>>>>> of just looking at KIND.  */
>>>>>>>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>>>>>>>> +  if (stmt_info)
>>>>>>>>>>> {
>>>>>>>>>>> /* If we scalarize a strided store, the vectorizer costs one
>>>>>>>>>>> vec_to_scalar for each element.  However, we can store the first
>>>>>>>>>>> element using an FP store without a separate extract step.  */
>>>>>>>>>>> -      if (vect_is_store_elt_extraction (kind, stmt_info))
>>>>>>>>>>> +      if (vect_is_store_elt_extraction (kind, stmt_info) && count 
>>>>>>>>>>> > 1)
>>>>>>>>>>> count -= 1;
>>>>>>>>>>> 
>>>>>>>>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
>>>>>>>>>>> @@ -17577,7 +17567,7 @@ aarch64_vector_costs::add_stmt_cost (int 
>>>>>>>>>>> count, vect_cost_for_stmt kind,
>>>>>>>>>>> else
>>>>>>>>>>> m_num_last_promote_demote = 0;
>>>>>>>>>>> 
>>>>>>>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>>>>>>>> +  if (stmt_info)
>>>>>>>>>>> {
>>>>>>>>>>> /* Account for any extra "embedded" costs that apply additively
>>>>>>>>>>> to the base cost calculated above.  */
>>>>>>>>>>> @@ -17938,9 +17928,7 @@ aarch64_vector_costs::finish_cost (const 
>>>>>>>>>>> vector_costs *uncast_scalar_costs)
>>>>>>>>>>> 
>>>>>>>>>>> auto *scalar_costs
>>>>>>>>>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
>>>>>>>>>>> -  if (loop_vinfo
>>>>>>>>>>> -      && m_vec_flags
>>>>>>>>>>> -      && aarch64_use_new_vector_costs_p ())
>>>>>>>>>>> +  if (loop_vinfo && m_vec_flags)
>>>>>>>>>>> {
>>>>>>>>>>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>>>>>>>>>>                                    m_costs[vect_body]);
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>>>>>>>> index eb9b89984b0..dafea96e924 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
>>>>>>>>>>> cortexx925_tunings =
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags. */
>>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h 
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>>>>>>>> index 6a098497759..ac001927959 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>>>>>>>> @@ -55,7 +55,6 @@ static const struct tune_params 
>>>>>>>>>>> fujitsu_monaka_tunings =
>>>>>>>>>>> 0, /* max_case_values.  */
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h 
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>>>>>>>> index 9b1cbfc5bd2..7b534831340 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>>>>>>>> @@ -183,7 +183,6 @@ static const struct tune_params 
>>>>>>>>>>> generic_armv8_a_tunings =
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>>>>>>>> index 48353a59939..562ef89c67b 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>>>>>>>> @@ -249,7 +249,6 @@ static const struct tune_params 
>>>>>>>>>>> generic_armv9_a_tunings =
>>>>>>>>>>> 0, /* max_case_values.  */
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>>>>>>>> &generic_armv9a_prefetch_tune,
>>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h 
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>>>>>>>> index c407b89a22f..fe4f7c10f73 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>>>>>>>> @@ -156,7 +156,6 @@ static const struct tune_params 
>>>>>>>>>>> neoverse512tvb_tunings =
>>>>>>>>>>> 0, /* max_case_values.  */
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>>>>>>>> index 18199ac206c..56be77423cb 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
>>>>>>>>>>> neoversen2_tunings =
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags. */
>>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h 
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>>>>>>>> index 4da85cfac0d..254ad5e27f8 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
>>>>>>>>>>> neoversen3_tunings =
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>>>>>>>> index dd9120eee48..c7241cf23d7 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>>>>>>>> @@ -227,7 +227,6 @@ static const struct tune_params 
>>>>>>>>>>> neoversev1_tunings =
>>>>>>>>>>> 0, /* max_case_values.  */
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags. */
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>>>>>>>> index 1369de73991..96f55940649 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>>>>>>>> @@ -232,7 +232,6 @@ static const struct tune_params 
>>>>>>>>>>> neoversev2_tunings =
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),        /* tune_flags.  */
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h 
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>>>>>>>> index d8c82255378..f62ae67d355 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
>>>>>>>>>>> neoversev3_tunings =
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags. */
>>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h 
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>>>>>>>> index 7f050501ede..0233baf5e34 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
>>>>>>>>>>> neoversev3ae_tunings =
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags. */
>>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c 
>>>>>>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>>>>>>>> index 762805ff54b..c334b7a6875 100644
>>>>>>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>>>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>>>>>>>> @@ -15,4 +15,4 @@
>>>>>>>>>>> so we vectorize the offset calculation.  This means that the
>>>>>>>>>>> 64-bit version needs two copies.  */
>>>>>>>>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, 
>>>>>>>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>>>>>>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, 
>>>>>>>>>>> p[0-7]/z, \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>>>>>>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, 
>>>>>>>>>>> p[0-7]/z, \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>>>>>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c 
>>>>>>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>>>>>>>> index f0ea58e38e2..94cc63049bc 100644
>>>>>>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>>>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>>>>>>>> @@ -15,4 +15,4 @@
>>>>>>>>>>> so we vectorize the offset calculation.  This means that the
>>>>>>>>>>> 64-bit version needs two copies.  */
>>>>>>>>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], 
>>>>>>>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>>>>>>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>>>>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>>>>>>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>>>>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Richard Biener <rguent...@suse.de>
>>>>>>>>>> SUSE Software Solutions Germany GmbH,
>>>>>>>>>> Frankenstrasse 146, 90461 Nuernberg, Germany;
>>>>>>>>>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG 
>>>>>>>>>> Nuernberg)

smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

Reply via email to