On Wed, Dec 18, 2024 at 6:30 PM Jennifer Schmitz <jschm...@nvidia.com> wrote:
>
>
>
> > On 17 Dec 2024, at 18:57, Richard Biener <rguent...@suse.de> wrote:
> >
> > External email: Use caution opening links or attachments
> >
> >
> >> Am 16.12.2024 um 09:10 schrieb Jennifer Schmitz <jschm...@nvidia.com>:
> >>
> >> 
> >>
> >>> On 14 Dec 2024, at 09:32, Richard Biener <rguent...@suse.de> wrote:
> >>>
> >>> External email: Use caution opening links or attachments
> >>>
> >>>
> >>>>> Am 13.12.2024 um 18:00 schrieb Jennifer Schmitz <jschm...@nvidia.com>:
> >>>>
> >>>> 
> >>>>
> >>>>> On 13 Dec 2024, at 13:40, Richard Biener <richard.guent...@gmail.com> 
> >>>>> wrote:
> >>>>>
> >>>>> External email: Use caution opening links or attachments
> >>>>>
> >>>>>
> >>>>>> On Thu, Dec 12, 2024 at 5:27 PM Jennifer Schmitz <jschm...@nvidia.com> 
> >>>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On 6 Dec 2024, at 08:41, Jennifer Schmitz <jschm...@nvidia.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> On 5 Dec 2024, at 20:07, Richard Sandiford 
> >>>>>>>> <richard.sandif...@arm.com> wrote:
> >>>>>>>>
> >>>>>>>> External email: Use caution opening links or attachments
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes:
> >>>>>>>>>> On 5 Dec 2024, at 11:44, Richard Biener <rguent...@suse.de> wrote:
> >>>>>>>>>>
> >>>>>>>>>> External email: Use caution opening links or attachments
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, 5 Dec 2024, Jennifer Schmitz wrote:
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> On 17 Oct 2024, at 19:23, Richard Sandiford 
> >>>>>>>>>>>> <richard.sandif...@arm.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> External email: Use caution opening links or attachments
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes:
> >>>>>>>>>>>>> [...]
> >>>>>>>>>>>>> Looking at the diff of the vect dumps (below is a section of 
> >>>>>>>>>>>>> the diff for strided_store_2.c), it seemed odd that 
> >>>>>>>>>>>>> vec_to_scalar operations cost 0 now, instead of the previous 
> >>>>>>>>>>>>> cost of 2:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> +strided_store_1.c:38:151: note:    === vectorizable_operation 
> >>>>>>>>>>>>> ===
> >>>>>>>>>>>>> +strided_store_1.c:38:151: note:    vect_model_simple_cost: 
> >>>>>>>>>>>>> inside_cost = 1, prologue_cost = 0 .
> >>>>>>>>>>>>> +strided_store_1.c:38:151: note:   ==> examining statement: *_6 
> >>>>>>>>>>>>> = _7;
> >>>>>>>>>>>>> +strided_store_1.c:38:151: note:   vect_is_simple_use: operand 
> >>>>>>>>>>>>> _3 + 1.0e+0, type of def:    internal
> >>>>>>>>>>>>> +strided_store_1.c:38:151: note:   Vectorizing an unaligned 
> >>>>>>>>>>>>> access.
> >>>>>>>>>>>>> +Applying pattern match.pd:236, generic-match-9.cc:4128
> >>>>>>>>>>>>> +Applying pattern match.pd:5285, generic-match-10.cc:4234
> >>>>>>>>>>>>> +strided_store_1.c:38:151: note:   vect_model_store_cost: 
> >>>>>>>>>>>>> inside_cost = 12, prologue_cost = 0 .
> >>>>>>>>>>>>> *_2 1 times unaligned_load (misalign -1) costs 1 in body
> >>>>>>>>>>>>> -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue
> >>>>>>>>>>>>> _3 + 1.0e+0 1 times vector_stmt costs 1 in body
> >>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
> >>>>>>>>>>>>> +<unknown> 1 times vector_load costs 1 in prologue
> >>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
> >>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body
> >>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
> >>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
> >>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body
> >>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
> >>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
> >>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body
> >>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
> >>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
> >>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Although the aarch64_use_new_vector_costs_p flag was used in 
> >>>>>>>>>>>>> multiple places in aarch64.cc, the location that causes this 
> >>>>>>>>>>>>> behavior is this one:
> >>>>>>>>>>>>> unsigned
> >>>>>>>>>>>>> aarch64_vector_costs::add_stmt_cost (int count, 
> >>>>>>>>>>>>> vect_cost_for_stmt kind,
> >>>>>>>>>>>>>                             stmt_vec_info stmt_info, slp_tree,
> >>>>>>>>>>>>>                             tree vectype, int misalign,
> >>>>>>>>>>>>>                             vect_cost_model_location where)
> >>>>>>>>>>>>> {
> >>>>>>>>>>>>> [...]
> >>>>>>>>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO 
> >>>>>>>>>>>>> instead
> >>>>>>>>>>>>> of just looking at KIND.  */
> >>>>>>>>>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
> >>>>>>>>>>>>> +  if (stmt_info)
> >>>>>>>>>>>>> {
> >>>>>>>>>>>>> /* If we scalarize a strided store, the vectorizer costs one
> >>>>>>>>>>>>> vec_to_scalar for each element.  However, we can store the first
> >>>>>>>>>>>>> element using an FP store without a separate extract step.  */
> >>>>>>>>>>>>> if (vect_is_store_elt_extraction (kind, stmt_info))
> >>>>>>>>>>>>> count -= 1;
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
> >>>>>>>>>>>>>                                              stmt_info, 
> >>>>>>>>>>>>> stmt_cost);
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> if (vectype && m_vec_flags)
> >>>>>>>>>>>>> stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind,
> >>>>>>>>>>>>>                                                stmt_info, 
> >>>>>>>>>>>>> vectype,
> >>>>>>>>>>>>>                                                where, 
> >>>>>>>>>>>>> stmt_cost);
> >>>>>>>>>>>>> }
> >>>>>>>>>>>>> [...]
> >>>>>>>>>>>>> return record_stmt_cost (stmt_info, where, (count * 
> >>>>>>>>>>>>> stmt_cost).ceil ());
> >>>>>>>>>>>>> }
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Previously, for mtune=generic, this function returned a cost of 
> >>>>>>>>>>>>> 2 for a vec_to_scalar operation in the vect body. Now "if 
> >>>>>>>>>>>>> (stmt_info)" is entered and "if (vect_is_store_elt_extraction 
> >>>>>>>>>>>>> (kind, stmt_info))" evaluates to true, which sets the count to 
> >>>>>>>>>>>>> 0 and leads to a return value of 0.
> >>>>>>>>>>>>
> >>>>>>>>>>>> At the time the code was written, a scalarised store would be 
> >>>>>>>>>>>> costed
> >>>>>>>>>>>> using one vec_to_scalar call into the backend, with the count 
> >>>>>>>>>>>> parameter
> >>>>>>>>>>>> set to the number of elements being stored.  The "count -= 1" was
> >>>>>>>>>>>> supposed to lop off the leading element extraction, since we can 
> >>>>>>>>>>>> store
> >>>>>>>>>>>> lane 0 as a normal FP store.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The target-independent costing was later reworked so that it 
> >>>>>>>>>>>> costs
> >>>>>>>>>>>> each operation individually:
> >>>>>>>>>>>>
> >>>>>>>>>>>>      for (i = 0; i < nstores; i++)
> >>>>>>>>>>>>        {
> >>>>>>>>>>>>          if (costing_p)
> >>>>>>>>>>>>            {
> >>>>>>>>>>>>              /* Only need vector extracting when there are more
> >>>>>>>>>>>>                 than one stores.  */
> >>>>>>>>>>>>              if (nstores > 1)
> >>>>>>>>>>>>                inside_cost
> >>>>>>>>>>>>                  += record_stmt_cost (cost_vec, 1, vec_to_scalar,
> >>>>>>>>>>>>                                       stmt_info, 0, vect_body);
> >>>>>>>>>>>>              /* Take a single lane vector type store as scalar
> >>>>>>>>>>>>                 store to avoid ICE like 110776.  */
> >>>>>>>>>>>>              if (VECTOR_TYPE_P (ltype)
> >>>>>>>>>>>>                  && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
> >>>>>>>>>>>>                n_adjacent_stores++;
> >>>>>>>>>>>>              else
> >>>>>>>>>>>>                inside_cost
> >>>>>>>>>>>>                  += record_stmt_cost (cost_vec, 1, scalar_store,
> >>>>>>>>>>>>                                       stmt_info, 0, vect_body);
> >>>>>>>>>>>>              continue;
> >>>>>>>>>>>>            }
> >>>>>>>>>>>>
> >>>>>>>>>>>> Unfortunately, there's no easy way of telling whether a 
> >>>>>>>>>>>> particular call
> >>>>>>>>>>>> is part of a group, and if so, which member of the group it is.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I suppose we could give up on the attempt to be (somewhat) 
> >>>>>>>>>>>> accurate
> >>>>>>>>>>>> and just disable the optimisation.  Or we could restrict it to 
> >>>>>>>>>>>> count > 1,
> >>>>>>>>>>>> since it might still be useful for gathers and scatters.
> >>>>>>>>>>> I tried restricting the calls to vect_is_store_elt_extraction to 
> >>>>>>>>>>> count > 1 and it seems to resolve the issue of costing 
> >>>>>>>>>>> vec_to_scalar operations with 0 (see patch below).
> >>>>>>>>>>> What are your thoughts on this?
> >>>>>>>>>>
> >>>>>>>>>> Why didn't you pursue instead moving the vec_to_scalar cost 
> >>>>>>>>>> together
> >>>>>>>>>> with the n_adjacent_store handling?
> >>>>>>>>> When I continued working on this patch, we had already reached 
> >>>>>>>>> stage 3 and I was hesitant to introduce changes to the middle-end 
> >>>>>>>>> that were not previously covered by this patch. So I tried if the 
> >>>>>>>>> issue could not be resolved by making a small change in the backend.
> >>>>>>>>> If you still advise to use the n_adjacent_store instead, I’m happy 
> >>>>>>>>> to look into it again.
> >>>>>>>>
> >>>>>>>> If Richard's ok with adjusting vectorizable_store for GCC 15 (which 
> >>>>>>>> it
> >>>>>>>> sounds like he is), then I agree that would be better.  Otherwise 
> >>>>>>>> we'd
> >>>>>>>> be creating technical debt to clean up for GCC 16.  And it is a 
> >>>>>>>> regression
> >>>>>>>> of sorts, so is stage 3 material from that POV.
> >>>>>>>>
> >>>>>>>> (Incidentally, AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS was itself a
> >>>>>>>> "let's clean this up next stage 1" thing, since we needed to add 
> >>>>>>>> tuning
> >>>>>>>> for a new CPU late during the cycle.  But of course, there were other
> >>>>>>>> priorities when stage 1 actually came around, so it never actually
> >>>>>>>> happened.  Thanks again for being the one to sort this out.)
> >>>>>>> Thanks for your feedback. Then I will try to make it work in 
> >>>>>>> vectorizable_store.
> >>>>>>> Best,
> >>>>>>> Jennifer
> >>>>>> Below is the updated patch with a suggestion for the changes in 
> >>>>>> vectorizable_store. It resolves the issue with the vec_to_scalar 
> >>>>>> operations that were individually costed with 0.
> >>>>>> We already tested it on aarch64, no regression, but we are still doing 
> >>>>>> performance testing.
> >>>>>> Can you give some feedback in the meantime on the patch itself?
> >>>>>> Thanks,
> >>>>>> Jennifer
> >>>>>>
> >>>>>>
> >>>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable 
> >>>>>> and
> >>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
> >>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
> >>>>>> default. To that end, the function aarch64_use_new_vector_costs_p and 
> >>>>>> its uses
> >>>>>> were removed. To prevent costing vec_to_scalar operations with 0, as
> >>>>>> described in
> >>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
> >>>>>> we adjusted vectorizable_store such that the variable n_adjacent_stores
> >>>>>> also covers vec_to_scalar operations. This way vec_to_scalar operations
> >>>>>> are not costed individually, but as a group.
> >>>>>>
> >>>>>> Two tests were adjusted due to changes in codegen. In both cases, the
> >>>>>> old code performed loop unrolling once, but the new code does not:
> >>>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
> >>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
> >>>>>> -moverride=tune=none):
> >>>>>> f_int64_t_32:
> >>>>>>    cbz     w3, .L92
> >>>>>>    mov     x4, 0
> >>>>>>    uxtw    x3, w3
> >>>>>> +       cntd    x5
> >>>>>> +       whilelo p7.d, xzr, x3
> >>>>>> +       mov     z29.s, w5
> >>>>>>    mov     z31.s, w2
> >>>>>> -       whilelo p6.d, xzr, x3
> >>>>>> -       mov     x2, x3
> >>>>>> -       index   z30.s, #0, #1
> >>>>>> -       uqdecd  x2
> >>>>>> -       ptrue   p5.b, all
> >>>>>> -       whilelo p7.d, xzr, x2
> >>>>>> +       index   z30.d, #0, #1
> >>>>>> +       ptrue   p6.b, all
> >>>>>>    .p2align 3,,7
> >>>>>> .L94:
> >>>>>> -       ld1d    z27.d, p7/z, [x0, #1, mul vl]
> >>>>>> -       ld1d    z28.d, p6/z, [x0]
> >>>>>> -       movprfx z29, z31
> >>>>>> -       mul     z29.s, p5/m, z29.s, z30.s
> >>>>>> -       incw    x4
> >>>>>> -       uunpklo z0.d, z29.s
> >>>>>> -       uunpkhi z29.d, z29.s
> >>>>>> -       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
> >>>>>> -       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
> >>>>>> -       add     z25.d, z28.d, z25.d
> >>>>>> +       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
> >>>>>> +       movprfx z28, z31
> >>>>>> +       mul     z28.s, p6/m, z28.s, z30.s
> >>>>>> +       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
> >>>>>>    add     z26.d, z27.d, z26.d
> >>>>>> -       st1d    z26.d, p7, [x0, #1, mul vl]
> >>>>>> -       whilelo p7.d, x4, x2
> >>>>>> -       st1d    z25.d, p6, [x0]
> >>>>>> -       incw    z30.s
> >>>>>> -       incb    x0, all, mul #2
> >>>>>> -       whilelo p6.d, x4, x3
> >>>>>> +       st1d    z26.d, p7, [x0, x4, lsl 3]
> >>>>>> +       add     z30.s, z30.s, z29.s
> >>>>>> +       incd    x4
> >>>>>> +       whilelo p7.d, x4, x3
> >>>>>>    b.any   .L94
> >>>>>> .L92:
> >>>>>>    ret
> >>>>>>
> >>>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
> >>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
> >>>>>> -moverride=tune=none):
> >>>>>> f_int64_t_32:
> >>>>>>    cbz     w3, .L84
> >>>>>> -       addvl   x5, x1, #1
> >>>>>>    mov     x4, 0
> >>>>>>    uxtw    x3, w3
> >>>>>> -       mov     z31.s, w2
> >>>>>> +       cntd    x5
> >>>>>>    whilelo p7.d, xzr, x3
> >>>>>> -       mov     x2, x3
> >>>>>> -       index   z30.s, #0, #1
> >>>>>> -       uqdecd  x2
> >>>>>> -       ptrue   p5.b, all
> >>>>>> -       whilelo p6.d, xzr, x2
> >>>>>> +       mov     z29.s, w5
> >>>>>> +       mov     z31.s, w2
> >>>>>> +       index   z30.d, #0, #1
> >>>>>> +       ptrue   p6.b, all
> >>>>>>    .p2align 3,,7
> >>>>>> .L86:
> >>>>>> -       ld1d    z28.d, p7/z, [x1, x4, lsl 3]
> >>>>>> -       ld1d    z27.d, p6/z, [x5, x4, lsl 3]
> >>>>>> -       movprfx z29, z30
> >>>>>> -       mul     z29.s, p5/m, z29.s, z31.s
> >>>>>> -       add     z28.d, z28.d, #1
> >>>>>> -       uunpklo z26.d, z29.s
> >>>>>> -       st1d    z28.d, p7, [x0, z26.d, lsl 3]
> >>>>>> -       incw    x4
> >>>>>> -       uunpkhi z29.d, z29.s
> >>>>>> +       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
> >>>>>> +       movprfx z28, z30
> >>>>>> +       mul     z28.s, p6/m, z28.s, z31.s
> >>>>>>    add     z27.d, z27.d, #1
> >>>>>> -       whilelo p6.d, x4, x2
> >>>>>> -       st1d    z27.d, p7, [x0, z29.d, lsl 3]
> >>>>>> -       incw    z30.s
> >>>>>> +       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
> >>>>>> +       incd    x4
> >>>>>> +       add     z30.s, z30.s, z29.s
> >>>>>>    whilelo p7.d, x4, x3
> >>>>>>    b.any   .L86
> >>>>>> .L84:
> >>>>>>    ret
> >>>>>>
> >>>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no
> >>>>>> regression.
> >>>>>> OK for mainline?
> >>>>>>
> >>>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
> >>>>>>
> >>>>>> gcc/
> >>>>>>    * tree-vect-stmts.cc (vectorizable_store): Extend the use of
> >>>>>>    n_adjacent_stores to also cover vec_to_scalar operations.
> >>>>>>    * config/aarch64/aarch64-tuning-flags.def: Remove
> >>>>>>    use_new_vector_costs as tuning option.
> >>>>>>    * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
> >>>>>>    Remove.
> >>>>>>    (aarch64_vector_costs::add_stmt_cost): Remove use of
> >>>>>>    aarch64_use_new_vector_costs_p.
> >>>>>>    (aarch64_vector_costs::finish_cost): Remove use of
> >>>>>>    aarch64_use_new_vector_costs_p.
> >>>>>>    * config/aarch64/tuning_models/cortexx925.h: Remove
> >>>>>>    AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
> >>>>>>    * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
> >>>>>>    * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
> >>>>>>    * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
> >>>>>>    * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
> >>>>>>    * config/aarch64/tuning_models/neoversen2.h: Likewise.
> >>>>>>    * config/aarch64/tuning_models/neoversen3.h: Likewise.
> >>>>>>    * config/aarch64/tuning_models/neoversev1.h: Likewise.
> >>>>>>    * config/aarch64/tuning_models/neoversev2.h: Likewise.
> >>>>>>    * config/aarch64/tuning_models/neoversev3.h: Likewise.
> >>>>>>    * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
> >>>>>>
> >>>>>> gcc/testsuite/
> >>>>>>    * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
> >>>>>>    * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
> >>>>>> ---
> >>>>>> gcc/config/aarch64/aarch64-tuning-flags.def   |  2 --
> >>>>>> gcc/config/aarch64/aarch64.cc                 | 20 +++----------
> >>>>>> gcc/config/aarch64/tuning_models/cortexx925.h |  1 -
> >>>>>> .../aarch64/tuning_models/fujitsu_monaka.h    |  1 -
> >>>>>> .../aarch64/tuning_models/generic_armv8_a.h   |  1 -
> >>>>>> .../aarch64/tuning_models/generic_armv9_a.h   |  1 -
> >>>>>> .../aarch64/tuning_models/neoverse512tvb.h    |  1 -
> >>>>>> gcc/config/aarch64/tuning_models/neoversen2.h |  1 -
> >>>>>> gcc/config/aarch64/tuning_models/neoversen3.h |  1 -
> >>>>>> gcc/config/aarch64/tuning_models/neoversev1.h |  1 -
> >>>>>> gcc/config/aarch64/tuning_models/neoversev2.h |  1 -
> >>>>>> gcc/config/aarch64/tuning_models/neoversev3.h |  1 -
> >>>>>> .../aarch64/tuning_models/neoversev3ae.h      |  1 -
> >>>>>> .../gcc.target/aarch64/sve/strided_load_2.c   |  2 +-
> >>>>>> .../gcc.target/aarch64/sve/strided_store_2.c  |  2 +-
> >>>>>> gcc/tree-vect-stmts.cc                        | 29 ++++++++++---------
> >>>>>> 16 files changed, 22 insertions(+), 44 deletions(-)
> >>>>>>
> >>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
> >>>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>>>> index ffbff20e29c..1de633c739b 100644
> >>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", 
> >>>>>> CHEAP_SHIFT_EXTEND)
> >>>>>>
> >>>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", 
> >>>>>> CSE_SVE_VL_CONSTANTS)
> >>>>>>
> >>>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", 
> >>>>>> USE_NEW_VECTOR_COSTS)
> >>>>>> -
> >>>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", 
> >>>>>> MATCHED_VECTOR_THROUGHPUT)
> >>>>>>
> >>>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", 
> >>>>>> AVOID_CROSS_LOOP_FMA)
> >>>>>> diff --git a/gcc/config/aarch64/aarch64.cc 
> >>>>>> b/gcc/config/aarch64/aarch64.cc
> >>>>>> index 77a2a6bfa3a..71fba9cc63b 100644
> >>>>>> --- a/gcc/config/aarch64/aarch64.cc
> >>>>>> +++ b/gcc/config/aarch64/aarch64.cc
> >>>>>> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info 
> >>>>>> *vinfo, bool costing_for_scalar)
> >>>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar);
> >>>>>> }
> >>>>>>
> >>>>>> -/* Return true if the current CPU should use the new costs defined
> >>>>>> -   in GCC 11.  This should be removed for GCC 12 and above, with the
> >>>>>> -   costs applying to all CPUs instead.  */
> >>>>>> -static bool
> >>>>>> -aarch64_use_new_vector_costs_p ()
> >>>>>> -{
> >>>>>> -  return (aarch64_tune_params.extra_tuning_flags
> >>>>>> -         & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
> >>>>>> -}
> >>>>>> -
> >>>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE.  */
> >>>>>> static const simd_vec_cost *
> >>>>>> aarch64_simd_vec_costs (tree vectype)
> >>>>>> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int 
> >>>>>> count, vect_cost_for_stmt kind,
> >>>>>>
> >>>>>> /* Do one-time initialization based on the vinfo.  */
> >>>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
> >>>>>> -  if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
> >>>>>> +  if (!m_analyzed_vinfo)
> >>>>>> {
> >>>>>>   if (loop_vinfo)
> >>>>>>    analyze_loop_vinfo (loop_vinfo);
> >>>>>> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int 
> >>>>>> count, vect_cost_for_stmt kind,
> >>>>>>
> >>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
> >>>>>>  of just looking at KIND.  */
> >>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
> >>>>>> +  if (stmt_info)
> >>>>>> {
> >>>>>>   /* If we scalarize a strided store, the vectorizer costs one
> >>>>>>     vec_to_scalar for each element.  However, we can store the first
> >>>>>> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int 
> >>>>>> count, vect_cost_for_stmt kind,
> >>>>>> else
> >>>>>> m_num_last_promote_demote = 0;
> >>>>>>
> >>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
> >>>>>> +  if (stmt_info)
> >>>>>> {
> >>>>>>   /* Account for any extra "embedded" costs that apply additively
> >>>>>>     to the base cost calculated above.  */
> >>>>>> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const 
> >>>>>> vector_costs *uncast_scalar_costs)
> >>>>>>
> >>>>>> auto *scalar_costs
> >>>>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
> >>>>>> -  if (loop_vinfo
> >>>>>> -      && m_vec_flags
> >>>>>> -      && aarch64_use_new_vector_costs_p ())
> >>>>>> +  if (loop_vinfo && m_vec_flags)
> >>>>>> {
> >>>>>>   m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
> >>>>>>                                         m_costs[vect_body]);
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
> >>>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>>>> index b2ff716157a..0a8eff69307 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params cortexx925_tunings 
> >>>>>> =
> >>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>>>> (AARCH64_EXTRA_TUNE_BASE
> >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
> >>>>>> &generic_prefetch_tune,
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h 
> >>>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>>>> index 2d704ecd110..a564528f43d 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>>>> @@ -55,7 +55,6 @@ static const struct tune_params 
> >>>>>> fujitsu_monaka_tunings =
> >>>>>> 0,   /* max_case_values.  */
> >>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>>>> (AARCH64_EXTRA_TUNE_BASE
> >>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> >>>>>> &generic_prefetch_tune,
> >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h 
> >>>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>>>> index bdd309ab03d..f090d5cde50 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>>>> @@ -183,7 +183,6 @@ static const struct tune_params 
> >>>>>> generic_armv8_a_tunings =
> >>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>>>> (AARCH64_EXTRA_TUNE_BASE
> >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> >>>>>> &generic_prefetch_tune,
> >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
> >>>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>>>> index a05a9ab92a2..4c33c147444 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>>>> @@ -249,7 +249,6 @@ static const struct tune_params 
> >>>>>> generic_armv9_a_tunings =
> >>>>>> 0,   /* max_case_values.  */
> >>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>>>> (AARCH64_EXTRA_TUNE_BASE
> >>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> >>>>>> &generic_armv9a_prefetch_tune,
> >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h 
> >>>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>>>> index c407b89a22f..fe4f7c10f73 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>>>> @@ -156,7 +156,6 @@ static const struct tune_params 
> >>>>>> neoverse512tvb_tunings =
> >>>>>> 0,   /* max_case_values.  */
> >>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> >>>>>> &generic_prefetch_tune,
> >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
> >>>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>>>> index fd5f8f37370..0c74068da2c 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings 
> >>>>>> =
> >>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>>>> (AARCH64_EXTRA_TUNE_BASE
> >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
> >>>>>> &generic_prefetch_tune,
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h 
> >>>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>>>> index 8b156c2fe4d..9d4e1be171a 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings 
> >>>>>> =
> >>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>>>> (AARCH64_EXTRA_TUNE_BASE
> >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
> >>>>>> &generic_prefetch_tune,
> >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
> >>>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>>>> index 23c121d8652..85a78bb2bef 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>>>> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings 
> >>>>>> =
> >>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>>>> (AARCH64_EXTRA_TUNE_BASE
> >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
> >>>>>> &generic_prefetch_tune,
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
> >>>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>>>> index 40af5f47f4f..1dd452beb8d 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>>>> @@ -232,7 +232,6 @@ static const struct tune_params neoversev2_tunings 
> >>>>>> =
> >>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>>>> (AARCH64_EXTRA_TUNE_BASE
> >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
> >>>>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),  /* tune_flags.  */
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h 
> >>>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>>>> index d65d74bfecf..d0ba5b1aef6 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings 
> >>>>>> =
> >>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>>>> (AARCH64_EXTRA_TUNE_BASE
> >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
> >>>>>> &generic_prefetch_tune,
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h 
> >>>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>>>> index 7b7fa0b4b08..a1572048503 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
> >>>>>> neoversev3ae_tunings =
> >>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>>>> (AARCH64_EXTRA_TUNE_BASE
> >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
> >>>>>> &generic_prefetch_tune,
> >>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c 
> >>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>>>> index 762805ff54b..c334b7a6875 100644
> >>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>>>> @@ -15,4 +15,4 @@
> >>>>>> so we vectorize the offset calculation.  This means that the
> >>>>>> 64-bit version needs two copies.  */
> >>>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, 
> >>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
> >>>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
> >>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
> >>>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
> >>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
> >>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c 
> >>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>>>> index f0ea58e38e2..94cc63049bc 100644
> >>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>>>> @@ -15,4 +15,4 @@
> >>>>>> so we vectorize the offset calculation.  This means that the
> >>>>>> 64-bit version needs two copies.  */
> >>>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], 
> >>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
> >>>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
> >>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
> >>>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
> >>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
> >>>>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> >>>>>> index be1139a423c..6d7d28c4702 100644
> >>>>>> --- a/gcc/tree-vect-stmts.cc
> >>>>>> +++ b/gcc/tree-vect-stmts.cc
> >>>>>> @@ -8834,19 +8834,16 @@ vectorizable_store (vec_info *vinfo,
> >>>>>>            {
> >>>>>>              if (costing_p)
> >>>>>>                {
> >>>>>> -                     /* Only need vector extracting when there are 
> >>>>>> more
> >>>>>> -                        than one stores.  */
> >>>>>> -                     if (nstores > 1)
> >>>>>> -                       inside_cost
> >>>>>> -                         += record_stmt_cost (cost_vec, 1, 
> >>>>>> vec_to_scalar,
> >>>>>> -                                              stmt_info, slp_node,
> >>>>>> -                                              0, vect_body);
> >>>>>>                  /* Take a single lane vector type store as scalar
> >>>>>>                     store to avoid ICE like 110776.  */
> >>>>>> -                     if (VECTOR_TYPE_P (ltype)
> >>>>>> -                         && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 
> >>>>>> 1U))
> >>>>>> +                     bool single_lane_vec_p =
> >>>>>> +                       VECTOR_TYPE_P (ltype)
> >>>>>> +                       && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U);
> >>>>>> +                     /* Only need vector extracting when there are 
> >>>>>> more
> >>>>>> +                        than one stores.  */
> >>>>>> +                     if (nstores > 1 || single_lane_vec_p)
> >>>>>>                    n_adjacent_stores++;
> >>>>>> -                     else
> >>>>>> +                     if (!single_lane_vec_p)
> >>>>>
> >>>>> I think it's somewhat non-obvious that nstores > 1 and single_lane_vec_p
> >>>>> correlate.  In fact I think that we always record a store, just for
> >>>>> single-element
> >>>>> vectors we record scalar stores.  I suggest to here always to just
> >>>>> n_adjacent_stores++
> >>>>> and below ...
> >>>>>
> >>>>>>                    inside_cost
> >>>>>>                      += record_stmt_cost (cost_vec, 1, scalar_store,
> >>>>>>                                           stmt_info, 0, vect_body);
> >>>>>> @@ -8905,9 +8902,15 @@ vectorizable_store (vec_info *vinfo,
> >>>>>>   if (costing_p)
> >>>>>>    {
> >>>>>>      if (n_adjacent_stores > 0)
> >>>>>> -           vect_get_store_cost (vinfo, stmt_info, slp_node, 
> >>>>>> n_adjacent_stores,
> >>>>>> -                                alignment_support_scheme, 
> >>>>>> misalignment,
> >>>>>> -                                &inside_cost, cost_vec);
> >>>>>> +           {
> >>>>>> +             vect_get_store_cost (vinfo, stmt_info, slp_node, 
> >>>>>> n_adjacent_stores,
> >>>>>> +                                  alignment_support_scheme, 
> >>>>>> misalignment,
> >>>>>> +                                  &inside_cost, cost_vec);
> >>>>>
> >>>>> ... record n_adjacent_stores scalar_store when ltype is single-lane and 
> >>>>> record
> >>>>> n_adjacent_stores vect_to_scalar if nstores > 1 (and else none).
> >>>>>
> >>>>> Richard.
> >>>> Thanks for the feedback, I’m glad it’s going in the right direction. 
> >>>> Below is the updated patch, re-validated on aarch64.
> >>>> Thanks, Jennifer
> >>>>
> >>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable 
> >>>> and
> >>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
> >>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
> >>>> default. To that end, the function aarch64_use_new_vector_costs_p and 
> >>>> its uses
> >>>> were removed. To prevent costing vec_to_scalar operations with 0, as
> >>>> described in
> >>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
> >>>> we adjusted vectorizable_store such that the variable n_adjacent_stores
> >>>> also covers vec_to_scalar operations. This way vec_to_scalar operations
> >>>> are not costed individually, but as a group.
> >>>>
> >>>> Two tests were adjusted due to changes in codegen. In both cases, the
> >>>> old code performed loop unrolling once, but the new code does not:
> >>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
> >>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
> >>>> -moverride=tune=none):
> >>>> f_int64_t_32:
> >>>>     cbz     w3, .L92
> >>>>     mov     x4, 0
> >>>>     uxtw    x3, w3
> >>>> +       cntd    x5
> >>>> +       whilelo p7.d, xzr, x3
> >>>> +       mov     z29.s, w5
> >>>>     mov     z31.s, w2
> >>>> -       whilelo p6.d, xzr, x3
> >>>> -       mov     x2, x3
> >>>> -       index   z30.s, #0, #1
> >>>> -       uqdecd  x2
> >>>> -       ptrue   p5.b, all
> >>>> -       whilelo p7.d, xzr, x2
> >>>> +       index   z30.d, #0, #1
> >>>> +       ptrue   p6.b, all
> >>>>     .p2align 3,,7
> >>>> .L94:
> >>>> -       ld1d    z27.d, p7/z, [x0, #1, mul vl]
> >>>> -       ld1d    z28.d, p6/z, [x0]
> >>>> -       movprfx z29, z31
> >>>> -       mul     z29.s, p5/m, z29.s, z30.s
> >>>> -       incw    x4
> >>>> -       uunpklo z0.d, z29.s
> >>>> -       uunpkhi z29.d, z29.s
> >>>> -       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
> >>>> -       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
> >>>> -       add     z25.d, z28.d, z25.d
> >>>> +       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
> >>>> +       movprfx z28, z31
> >>>> +       mul     z28.s, p6/m, z28.s, z30.s
> >>>> +       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
> >>>>     add     z26.d, z27.d, z26.d
> >>>> -       st1d    z26.d, p7, [x0, #1, mul vl]
> >>>> -       whilelo p7.d, x4, x2
> >>>> -       st1d    z25.d, p6, [x0]
> >>>> -       incw    z30.s
> >>>> -       incb    x0, all, mul #2
> >>>> -       whilelo p6.d, x4, x3
> >>>> +       st1d    z26.d, p7, [x0, x4, lsl 3]
> >>>> +       add     z30.s, z30.s, z29.s
> >>>> +       incd    x4
> >>>> +       whilelo p7.d, x4, x3
> >>>>     b.any   .L94
> >>>> .L92:
> >>>>     ret
> >>>>
> >>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
> >>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
> >>>> -moverride=tune=none):
> >>>> f_int64_t_32:
> >>>>     cbz     w3, .L84
> >>>> -       addvl   x5, x1, #1
> >>>>     mov     x4, 0
> >>>>     uxtw    x3, w3
> >>>> -       mov     z31.s, w2
> >>>> +       cntd    x5
> >>>>     whilelo p7.d, xzr, x3
> >>>> -       mov     x2, x3
> >>>> -       index   z30.s, #0, #1
> >>>> -       uqdecd  x2
> >>>> -       ptrue   p5.b, all
> >>>> -       whilelo p6.d, xzr, x2
> >>>> +       mov     z29.s, w5
> >>>> +       mov     z31.s, w2
> >>>> +       index   z30.d, #0, #1
> >>>> +       ptrue   p6.b, all
> >>>>     .p2align 3,,7
> >>>> .L86:
> >>>> -       ld1d    z28.d, p7/z, [x1, x4, lsl 3]
> >>>> -       ld1d    z27.d, p6/z, [x5, x4, lsl 3]
> >>>> -       movprfx z29, z30
> >>>> -       mul     z29.s, p5/m, z29.s, z31.s
> >>>> -       add     z28.d, z28.d, #1
> >>>> -       uunpklo z26.d, z29.s
> >>>> -       st1d    z28.d, p7, [x0, z26.d, lsl 3]
> >>>> -       incw    x4
> >>>> -       uunpkhi z29.d, z29.s
> >>>> +       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
> >>>> +       movprfx z28, z30
> >>>> +       mul     z28.s, p6/m, z28.s, z31.s
> >>>>     add     z27.d, z27.d, #1
> >>>> -       whilelo p6.d, x4, x2
> >>>> -       st1d    z27.d, p7, [x0, z29.d, lsl 3]
> >>>> -       incw    z30.s
> >>>> +       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
> >>>> +       incd    x4
> >>>> +       add     z30.s, z30.s, z29.s
> >>>>     whilelo p7.d, x4, x3
> >>>>     b.any   .L86
> >>>> .L84:
> >>>> ret
> >>>>
> >>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no
> >>>> regression.
> >>>> OK for mainline?
> >>>>
> >>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
> >>>>
> >>>> gcc/
> >>>> * tree-vect-stmts.cc (vectorizable_store): Extend the use of
> >>>> n_adjacent_stores to also cover vec_to_scalar operations.
> >>>> * config/aarch64/aarch64-tuning-flags.def: Remove
> >>>> use_new_vector_costs as tuning option.
> >>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
> >>>> Remove.
> >>>> (aarch64_vector_costs::add_stmt_cost): Remove use of
> >>>> aarch64_use_new_vector_costs_p.
> >>>> (aarch64_vector_costs::finish_cost): Remove use of
> >>>> aarch64_use_new_vector_costs_p.
> >>>> * config/aarch64/tuning_models/cortexx925.h: Remove
> >>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
> >>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
> >>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
> >>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
> >>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
> >>>> * config/aarch64/tuning_models/neoversen2.h: Likewise.
> >>>> * config/aarch64/tuning_models/neoversen3.h: Likewise.
> >>>> * config/aarch64/tuning_models/neoversev1.h: Likewise.
> >>>> * config/aarch64/tuning_models/neoversev2.h: Likewise.
> >>>> * config/aarch64/tuning_models/neoversev3.h: Likewise.
> >>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
> >>>>
> >>>> gcc/testsuite/
> >>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
> >>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
> >>>> ---
> >>>> gcc/config/aarch64/aarch64-tuning-flags.def   |  2 -
> >>>> gcc/config/aarch64/aarch64.cc                 | 20 ++--------
> >>>> gcc/config/aarch64/tuning_models/cortexx925.h |  1 -
> >>>> .../aarch64/tuning_models/fujitsu_monaka.h    |  1 -
> >>>> .../aarch64/tuning_models/generic_armv8_a.h   |  1 -
> >>>> .../aarch64/tuning_models/generic_armv9_a.h   |  1 -
> >>>> .../aarch64/tuning_models/neoverse512tvb.h    |  1 -
> >>>> gcc/config/aarch64/tuning_models/neoversen2.h |  1 -
> >>>> gcc/config/aarch64/tuning_models/neoversen3.h |  1 -
> >>>> gcc/config/aarch64/tuning_models/neoversev1.h |  1 -
> >>>> gcc/config/aarch64/tuning_models/neoversev2.h |  1 -
> >>>> gcc/config/aarch64/tuning_models/neoversev3.h |  1 -
> >>>> .../aarch64/tuning_models/neoversev3ae.h      |  1 -
> >>>> .../gcc.target/aarch64/sve/strided_load_2.c   |  2 +-
> >>>> .../gcc.target/aarch64/sve/strided_store_2.c  |  2 +-
> >>>> gcc/tree-vect-stmts.cc                        | 37 +++++++++++--------
> >>>> 16 files changed, 27 insertions(+), 47 deletions(-)
> >>>>
> >>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
> >>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>> index ffbff20e29c..1de633c739b 100644
> >>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", 
> >>>> CHEAP_SHIFT_EXTEND)
> >>>>
> >>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", 
> >>>> CSE_SVE_VL_CONSTANTS)
> >>>>
> >>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", 
> >>>> USE_NEW_VECTOR_COSTS)
> >>>> -
> >>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", 
> >>>> MATCHED_VECTOR_THROUGHPUT)
> >>>>
> >>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", 
> >>>> AVOID_CROSS_LOOP_FMA)
> >>>> diff --git a/gcc/config/aarch64/aarch64.cc 
> >>>> b/gcc/config/aarch64/aarch64.cc
> >>>> index 77a2a6bfa3a..71fba9cc63b 100644
> >>>> --- a/gcc/config/aarch64/aarch64.cc
> >>>> +++ b/gcc/config/aarch64/aarch64.cc
> >>>> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info 
> >>>> *vinfo, bool costing_for_scalar)
> >>>> return new aarch64_vector_costs (vinfo, costing_for_scalar);
> >>>> }
> >>>>
> >>>> -/* Return true if the current CPU should use the new costs defined
> >>>> -   in GCC 11.  This should be removed for GCC 12 and above, with the
> >>>> -   costs applying to all CPUs instead.  */
> >>>> -static bool
> >>>> -aarch64_use_new_vector_costs_p ()
> >>>> -{
> >>>> -  return (aarch64_tune_params.extra_tuning_flags
> >>>> -      & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
> >>>> -}
> >>>> -
> >>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE. */
> >>>> static const simd_vec_cost *
> >>>> aarch64_simd_vec_costs (tree vectype)
> >>>> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
> >>>> vect_cost_for_stmt kind,
> >>>>
> >>>> /* Do one-time initialization based on the vinfo.  */
> >>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
> >>>> -  if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
> >>>> +  if (!m_analyzed_vinfo)
> >>>>  {
> >>>>    if (loop_vinfo)
> >>>> analyze_loop_vinfo (loop_vinfo);
> >>>> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
> >>>> vect_cost_for_stmt kind,
> >>>>
> >>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
> >>>>   of just looking at KIND.  */
> >>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
> >>>> +  if (stmt_info)
> >>>>  {
> >>>>    /* If we scalarize a strided store, the vectorizer costs one
> >>>>  vec_to_scalar for each element.  However, we can store the first
> >>>> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
> >>>> vect_cost_for_stmt kind,
> >>>> else
> >>>>  m_num_last_promote_demote = 0;
> >>>>
> >>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
> >>>> +  if (stmt_info)
> >>>>  {
> >>>>    /* Account for any extra "embedded" costs that apply additively
> >>>>  to the base cost calculated above.  */
> >>>> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const 
> >>>> vector_costs *uncast_scalar_costs)
> >>>>
> >>>> auto *scalar_costs
> >>>>  = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
> >>>> -  if (loop_vinfo
> >>>> -      && m_vec_flags
> >>>> -      && aarch64_use_new_vector_costs_p ())
> >>>> +  if (loop_vinfo && m_vec_flags)
> >>>>  {
> >>>>    m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
> >>>>                      m_costs[vect_body]);
> >>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
> >>>> b/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>> index 5ebaf66e986..74772f3e15f 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>> @@ -221,7 +221,6 @@ static const struct tune_params cortexx925_tunings =
> >>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>> (AARCH64_EXTRA_TUNE_BASE
> >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),    /* tune_flags.  */
> >>>> &generic_armv9a_prefetch_tune,
> >>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h 
> >>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>> index 2d704ecd110..a564528f43d 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings 
> >>>> =
> >>>> 0,    /* max_case_values.  */
> >>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>> (AARCH64_EXTRA_TUNE_BASE
> >>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags. */
> >>>> &generic_prefetch_tune,
> >>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> >>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h 
> >>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>> index bdd309ab03d..f090d5cde50 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>> @@ -183,7 +183,6 @@ static const struct tune_params 
> >>>> generic_armv8_a_tunings =
> >>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>> (AARCH64_EXTRA_TUNE_BASE
> >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags. */
> >>>> &generic_prefetch_tune,
> >>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> >>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
> >>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>> index 785e00946bc..7b5821183bc 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>> @@ -251,7 +251,6 @@ static const struct tune_params 
> >>>> generic_armv9_a_tunings =
> >>>> 0,    /* max_case_values.  */
> >>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>> (AARCH64_EXTRA_TUNE_BASE
> >>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags. */
> >>>> &generic_armv9a_prefetch_tune,
> >>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> >>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h 
> >>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>> index 007f987154c..f7457df59e5 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>> @@ -156,7 +156,6 @@ static const struct tune_params 
> >>>> neoverse512tvb_tunings =
> >>>> 0,    /* max_case_values.  */
> >>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags. */
> >>>> &generic_armv9a_prefetch_tune,
> >>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
> >>>> b/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>> index 32560d2f5f8..541b61c8179 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings =
> >>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>> (AARCH64_EXTRA_TUNE_BASE
> >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),    /* tune_flags.  */
> >>>> &generic_armv9a_prefetch_tune,
> >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h 
> >>>> b/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>> index 2010bc4645b..eff668132a8 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings =
> >>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>> (AARCH64_EXTRA_TUNE_BASE
> >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags. */
> >>>> &generic_armv9a_prefetch_tune,
> >>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
> >>>> b/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>> index c3751e32696..d11472b6e1e 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings =
> >>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>> (AARCH64_EXTRA_TUNE_BASE
> >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),    /* tune_flags.  */
> >>>> &generic_armv9a_prefetch_tune,
> >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
> >>>> b/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>> index 80dbe5c806c..ee77ffdd3bc 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev2_tunings =
> >>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>> (AARCH64_EXTRA_TUNE_BASE
> >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
> >>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),    /* tune_flags.  */
> >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h 
> >>>> b/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>> index efe09e16d1e..6ef143ef7d5 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings =
> >>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>> (AARCH64_EXTRA_TUNE_BASE
> >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),    /* tune_flags.  */
> >>>> &generic_armv9a_prefetch_tune,
> >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h 
> >>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>> index 66849f30889..96bdbf971f1 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings 
> >>>> =
> >>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> >>>> (AARCH64_EXTRA_TUNE_BASE
> >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),    /* tune_flags.  */
> >>>> &generic_armv9a_prefetch_tune,
> >>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c 
> >>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>> index 762805ff54b..c334b7a6875 100644
> >>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>> @@ -15,4 +15,4 @@
> >>>> so we vectorize the offset calculation.  This means that the
> >>>> 64-bit version needs two copies.  */
> >>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, 
> >>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
> >>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
> >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
> >>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
> >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
> >>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c 
> >>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>> index f0ea58e38e2..94cc63049bc 100644
> >>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>> @@ -15,4 +15,4 @@
> >>>> so we vectorize the offset calculation.  This means that the
> >>>> 64-bit version needs two copies.  */
> >>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], 
> >>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
> >>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
> >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
> >>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
> >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
> >>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> >>>> index be1139a423c..ab57163c243 100644
> >>>> --- a/gcc/tree-vect-stmts.cc
> >>>> +++ b/gcc/tree-vect-stmts.cc
> >>>> @@ -8834,19 +8834,8 @@ vectorizable_store (vec_info *vinfo,
> >>>>     {
> >>>>       if (costing_p)
> >>>>         {
> >>>> -              /* Only need vector extracting when there are more
> >>>> -             than one stores.  */
> >>>> -              if (nstores > 1)
> >>>> -            inside_cost
> >>>> -              += record_stmt_cost (cost_vec, 1, vec_to_scalar,
> >>>> -                           stmt_info, slp_node,
> >>>> -                           0, vect_body);
> >>>> -              /* Take a single lane vector type store as scalar
> >>>> -             store to avoid ICE like 110776.  */
> >>>> -              if (VECTOR_TYPE_P (ltype)
> >>>> -              && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
> >>>> -            n_adjacent_stores++;
> >>>> -              else
> >>>> +              n_adjacent_stores++;
> >>>> +              if (!VECTOR_TYPE_P (ltype))
> >>>
> >>> This should be combined with the Single lane Vector case belle
> >>>
> >>>>         inside_cost
> >>>>           += record_stmt_cost (cost_vec, 1, scalar_store,
> >>>>                        stmt_info, 0, vect_body);
> >>>> @@ -8905,9 +8894,25 @@ vectorizable_store (vec_info *vinfo,
> >>>>    if (costing_p)
> >>>> {
> >>>>   if (n_adjacent_stores > 0)
> >>>> -        vect_get_store_cost (vinfo, stmt_info, slp_node, 
> >>>> n_adjacent_stores,
> >>>> -                 alignment_support_scheme, misalignment,
> >>>> -                 &inside_cost, cost_vec);
> >>>> +        {
> >>>> +          /* Take a single lane vector type store as scalar
> >>>> +         store to avoid ICE like 110776.  */
> >>>> +          if (VECTOR_TYPE_P (ltype)
> >>>> +          && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
> >>>> +        inside_cost
> >>>> +          += record_stmt_cost (cost_vec, n_adjacent_stores,
> >>>> +                       scalar_store, stmt_info, 0, vect_body);
> >>>> +          /* Only need vector extracting when there are more
> >>>> +         than one stores.  */
> >>>> +          if (nstores > 1)
> >>>> +        inside_cost
> >>>> +          += record_stmt_cost (cost_vec, n_adjacent_stores,
> >>>> +                       vec_to_scalar, stmt_info, slp_node,
> >>>> +                       0, vect_body);
> >>>> +          vect_get_store_cost (vinfo, stmt_info, slp_node,
> >>>
> >>> This should be Inlay done for Multi-lane vectors
> >> Thanks for the quick reply. As I am making the changes, I am wondering: Do 
> >> we even need n_adjacent_stores anymore? It appears to always have the same 
> >> value as nstores. Can we remove it and use nstores instead or does it 
> >> still serve another purpose?
> >
> > It was a heuristic needed for powerpc(?), can you confirm we’re not 
> > combining stores from VF unrolling for strided SLP stores?
> Hi Richard,
> the reasoning behind my suggestion to replace n_adjacent_stores by nstores in 
> this code section is that with my patch they will logically always have the 
> same value.
>
> Having said that, I looked into why n_adjacent_stores was introduced in the 
> first place: The patch [1] that introduced n_adjacent_stores fixed a 
> regression on aarch64 by costing vector loads/stores together. The variables 
> n_adjacent_stores and n_adjacent_loads were added in two code sections each 
> in vectorizable_store and vectorizable_load. The connection to PowerPC you 
> recalled is also mentioned in the PR, but I believe it refers to the enum 
> dr_alignment_support alignment_support_scheme that is used in
>
> vect_get_store_cost (vinfo, stmt_info, slp_node,
>                      _adjacent_stores, alignment_support_scheme,
>                      misalignment, &inside_cost, cost_vec);
>
> to which I made no changes other than refactoring the if-statement around it.
>
> So, taking the fact that n_adjacent_stores has been introduced in multiple 
> locations into account I would actually leave n_adjacent_stores in the code 
> section that I made changes to in order to keep vectorizable_store and 
> vectorizable_load consistent.
>
> Regarding your question about not combining stores from loop unrolling for 
> strided SLP stores: I'm not entirely sure what you mean, but could it be 
> covered by the tests gcc.target/aarch64/ldp_stp_* that were also mentioned in 
> [1]?

I'm refering to a case with variable stride

 for (.. i += s)
   {
      a[4*i] = ..;
      a[4*i + 1] = ...;
      a[4*i + 2] = ...;
      a[4*i + 3] = ...;
   }

where we might choose to store to the V4SI destination using two
V2SI stores (adjacent), iff the VF ends up equal two we'd have two
sets of a[] stores, thus four V2SI stores but only two of them would be
"adjacent".  Note I don't know whether "adjacent" really was supposed
to be adjacent or rather "related".

Anyway, the costing interface for loads and stores is likely to change
sustantially for GCC 16.

> I added the changes you proposed in the updated patch below, but kept 
> n_adjacent_stores. The patch was re-validated on aarch64.
> Thanks,
> Jennifer
>
> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111784#c3
>
>
> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and
> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
> default. To that end, the function aarch64_use_new_vector_costs_p and its uses
> were removed. To prevent costing vec_to_scalar operations with 0, as
> described in
> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
> we adjusted vectorizable_store such that the variable n_adjacent_stores
> also covers vec_to_scalar operations. This way vec_to_scalar operations
> are not costed individually, but as a group.
>
> Two tests were adjusted due to changes in codegen. In both cases, the
> old code performed loop unrolling once, but the new code does not:
> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
> -moverride=tune=none):
> f_int64_t_32:
>         cbz     w3, .L92
>         mov     x4, 0
>         uxtw    x3, w3
> +       cntd    x5
> +       whilelo p7.d, xzr, x3
> +       mov     z29.s, w5
>         mov     z31.s, w2
> -       whilelo p6.d, xzr, x3
> -       mov     x2, x3
> -       index   z30.s, #0, #1
> -       uqdecd  x2
> -       ptrue   p5.b, all
> -       whilelo p7.d, xzr, x2
> +       index   z30.d, #0, #1
> +       ptrue   p6.b, all
>         .p2align 3,,7
>  .L94:
> -       ld1d    z27.d, p7/z, [x0, #1, mul vl]
> -       ld1d    z28.d, p6/z, [x0]
> -       movprfx z29, z31
> -       mul     z29.s, p5/m, z29.s, z30.s
> -       incw    x4
> -       uunpklo z0.d, z29.s
> -       uunpkhi z29.d, z29.s
> -       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
> -       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
> -       add     z25.d, z28.d, z25.d
> +       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
> +       movprfx z28, z31
> +       mul     z28.s, p6/m, z28.s, z30.s
> +       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
>         add     z26.d, z27.d, z26.d
> -       st1d    z26.d, p7, [x0, #1, mul vl]
> -       whilelo p7.d, x4, x2
> -       st1d    z25.d, p6, [x0]
> -       incw    z30.s
> -       incb    x0, all, mul #2
> -       whilelo p6.d, x4, x3
> +       st1d    z26.d, p7, [x0, x4, lsl 3]
> +       add     z30.s, z30.s, z29.s
> +       incd    x4
> +       whilelo p7.d, x4, x3
>         b.any   .L94
>  .L92:
>         ret
>
> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
> -moverride=tune=none):
> f_int64_t_32:
>         cbz     w3, .L84
> -       addvl   x5, x1, #1
>         mov     x4, 0
>         uxtw    x3, w3
> -       mov     z31.s, w2
> +       cntd    x5
>         whilelo p7.d, xzr, x3
> -       mov     x2, x3
> -       index   z30.s, #0, #1
> -       uqdecd  x2
> -       ptrue   p5.b, all
> -       whilelo p6.d, xzr, x2
> +       mov     z29.s, w5
> +       mov     z31.s, w2
> +       index   z30.d, #0, #1
> +       ptrue   p6.b, all
>         .p2align 3,,7
>  .L86:
> -       ld1d    z28.d, p7/z, [x1, x4, lsl 3]
> -       ld1d    z27.d, p6/z, [x5, x4, lsl 3]
> -       movprfx z29, z30
> -       mul     z29.s, p5/m, z29.s, z31.s
> -       add     z28.d, z28.d, #1
> -       uunpklo z26.d, z29.s
> -       st1d    z28.d, p7, [x0, z26.d, lsl 3]
> -       incw    x4
> -       uunpkhi z29.d, z29.s
> +       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
> +       movprfx z28, z30
> +       mul     z28.s, p6/m, z28.s, z31.s
>         add     z27.d, z27.d, #1
> -       whilelo p6.d, x4, x2
> -       st1d    z27.d, p7, [x0, z29.d, lsl 3]
> -       incw    z30.s
> +       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
> +       incd    x4
> +       add     z30.s, z30.s, z29.s
>         whilelo p7.d, x4, x3
>         b.any   .L86
>  .L84:
>         ret
>
> The patch was bootstrapped and tested on aarch64-linux-gnu, no
> regression.
> OK for mainline?

LGTM.

Richard.

> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>
> gcc/
>         * tree-vect-stmts.cc (vectorizable_store): Extend the use of
>         n_adjacent_stores to also cover vec_to_scalar operations.
>         * config/aarch64/aarch64-tuning-flags.def: Remove
>         use_new_vector_costs as tuning option.
>         * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
>         Remove.
>         (aarch64_vector_costs::add_stmt_cost): Remove use of
>         aarch64_use_new_vector_costs_p.
>         (aarch64_vector_costs::finish_cost): Remove use of
>         aarch64_use_new_vector_costs_p.
>         * config/aarch64/tuning_models/cortexx925.h: Remove
>         AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
>         * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
>         * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
>         * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
>         * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
>         * config/aarch64/tuning_models/neoversen2.h: Likewise.
>         * config/aarch64/tuning_models/neoversen3.h: Likewise.
>         * config/aarch64/tuning_models/neoversev1.h: Likewise.
>         * config/aarch64/tuning_models/neoversev2.h: Likewise.
>         * config/aarch64/tuning_models/neoversev3.h: Likewise.
>         * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
>
> gcc/testsuite/
>         * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
>         * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
> ---
>  gcc/config/aarch64/aarch64-tuning-flags.def   |  2 -
>  gcc/config/aarch64/aarch64.cc                 | 20 ++--------
>  gcc/config/aarch64/tuning_models/cortexx925.h |  1 -
>  .../aarch64/tuning_models/fujitsu_monaka.h    |  1 -
>  .../aarch64/tuning_models/generic_armv8_a.h   |  1 -
>  .../aarch64/tuning_models/generic_armv9_a.h   |  1 -
>  .../aarch64/tuning_models/neoverse512tvb.h    |  1 -
>  gcc/config/aarch64/tuning_models/neoversen2.h |  1 -
>  gcc/config/aarch64/tuning_models/neoversen3.h |  1 -
>  gcc/config/aarch64/tuning_models/neoversev1.h |  1 -
>  gcc/config/aarch64/tuning_models/neoversev2.h |  1 -
>  gcc/config/aarch64/tuning_models/neoversev3.h |  1 -
>  .../aarch64/tuning_models/neoversev3ae.h      |  1 -
>  .../gcc.target/aarch64/sve/strided_load_2.c   |  2 +-
>  .../gcc.target/aarch64/sve/strided_store_2.c  |  2 +-
>  gcc/tree-vect-stmts.cc                        | 40 ++++++++++---------
>  16 files changed, 27 insertions(+), 50 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
> b/gcc/config/aarch64/aarch64-tuning-flags.def
> index ffbff20e29c..1de633c739b 100644
> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", 
> CHEAP_SHIFT_EXTEND)
>
>  AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
>
> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS)
> -
>  AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", 
> MATCHED_VECTOR_THROUGHPUT)
>
>  AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA)
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 77a2a6bfa3a..71fba9cc63b 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo, 
> bool costing_for_scalar)
>    return new aarch64_vector_costs (vinfo, costing_for_scalar);
>  }
>
> -/* Return true if the current CPU should use the new costs defined
> -   in GCC 11.  This should be removed for GCC 12 and above, with the
> -   costs applying to all CPUs instead.  */
> -static bool
> -aarch64_use_new_vector_costs_p ()
> -{
> -  return (aarch64_tune_params.extra_tuning_flags
> -         & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
> -}
> -
>  /* Return the appropriate SIMD costs for vectors of type VECTYPE.  */
>  static const simd_vec_cost *
>  aarch64_simd_vec_costs (tree vectype)
> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
> vect_cost_for_stmt kind,
>
>    /* Do one-time initialization based on the vinfo.  */
>    loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
> -  if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
> +  if (!m_analyzed_vinfo)
>      {
>        if (loop_vinfo)
>         analyze_loop_vinfo (loop_vinfo);
> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
> vect_cost_for_stmt kind,
>
>    /* Try to get a more accurate cost by looking at STMT_INFO instead
>       of just looking at KIND.  */
> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
> +  if (stmt_info)
>      {
>        /* If we scalarize a strided store, the vectorizer costs one
>          vec_to_scalar for each element.  However, we can store the first
> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
> vect_cost_for_stmt kind,
>    else
>      m_num_last_promote_demote = 0;
>
> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
> +  if (stmt_info)
>      {
>        /* Account for any extra "embedded" costs that apply additively
>          to the base cost calculated above.  */
> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const vector_costs 
> *uncast_scalar_costs)
>
>    auto *scalar_costs
>      = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
> -  if (loop_vinfo
> -      && m_vec_flags
> -      && aarch64_use_new_vector_costs_p ())
> +  if (loop_vinfo && m_vec_flags)
>      {
>        m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>                                              m_costs[vect_body]);
> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
> b/gcc/config/aarch64/tuning_models/cortexx925.h
> index 5ebaf66e986..74772f3e15f 100644
> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
> @@ -221,7 +221,6 @@ static const struct tune_params cortexx925_tunings =
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_BASE
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>     | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>    &generic_armv9a_prefetch_tune,
> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h 
> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> index 2d704ecd110..a564528f43d 100644
> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_BASE
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>    &generic_prefetch_tune,
>    AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h 
> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> index bdd309ab03d..f090d5cde50 100644
> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> @@ -183,7 +183,6 @@ static const struct tune_params generic_armv8_a_tunings =
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_BASE
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>    &generic_prefetch_tune,
>    AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> index 785e00946bc..7b5821183bc 100644
> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> @@ -251,7 +251,6 @@ static const struct tune_params generic_armv9_a_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_BASE
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>    &generic_armv9a_prefetch_tune,
>    AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h 
> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> index 007f987154c..f7457df59e5 100644
> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> @@ -156,7 +156,6 @@ static const struct tune_params neoverse512tvb_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>    &generic_armv9a_prefetch_tune,
>    AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
> b/gcc/config/aarch64/tuning_models/neoversen2.h
> index 32560d2f5f8..541b61c8179 100644
> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings =
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_BASE
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>     | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>    &generic_armv9a_prefetch_tune,
> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h 
> b/gcc/config/aarch64/tuning_models/neoversen3.h
> index 2010bc4645b..eff668132a8 100644
> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings =
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_BASE
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>    &generic_armv9a_prefetch_tune,
>    AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
> b/gcc/config/aarch64/tuning_models/neoversev1.h
> index c3751e32696..d11472b6e1e 100644
> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings =
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_BASE
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>     | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>    &generic_armv9a_prefetch_tune,
> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
> b/gcc/config/aarch64/tuning_models/neoversev2.h
> index 80dbe5c806c..ee77ffdd3bc 100644
> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
> @@ -219,7 +219,6 @@ static const struct tune_params neoversev2_tunings =
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_BASE
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>     | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
>     | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),  /* tune_flags.  */
> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h 
> b/gcc/config/aarch64/tuning_models/neoversev3.h
> index efe09e16d1e..6ef143ef7d5 100644
> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings =
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_BASE
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>     | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>    &generic_armv9a_prefetch_tune,
> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h 
> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> index 66849f30889..96bdbf971f1 100644
> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings =
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_BASE
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>     | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>    &generic_armv9a_prefetch_tune,
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> index 762805ff54b..c334b7a6875 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> @@ -15,4 +15,4 @@
>     so we vectorize the offset calculation.  This means that the
>     64-bit version needs two copies.  */
>  /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, 
> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> index f0ea58e38e2..94cc63049bc 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> @@ -15,4 +15,4 @@
>     so we vectorize the offset calculation.  This means that the
>     64-bit version needs two copies.  */
>  /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], \[x[0-9]+, 
> z[0-9]+.s, uxtw 2\]\n} 3 } } */
> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[x[0-9]+, 
> z[0-9]+.d, lsl 3\]\n} 15 } } */
> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[x[0-9]+, 
> z[0-9]+.d, lsl 3\]\n} 9 } } */
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index be1139a423c..a14248193ca 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -8834,22 +8834,7 @@ vectorizable_store (vec_info *vinfo,
>                 {
>                   if (costing_p)
>                     {
> -                     /* Only need vector extracting when there are more
> -                        than one stores.  */
> -                     if (nstores > 1)
> -                       inside_cost
> -                         += record_stmt_cost (cost_vec, 1, vec_to_scalar,
> -                                              stmt_info, slp_node,
> -                                              0, vect_body);
> -                     /* Take a single lane vector type store as scalar
> -                        store to avoid ICE like 110776.  */
> -                     if (VECTOR_TYPE_P (ltype)
> -                         && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
> -                       n_adjacent_stores++;
> -                     else
> -                       inside_cost
> -                         += record_stmt_cost (cost_vec, 1, scalar_store,
> -                                              stmt_info, 0, vect_body);
> +                     n_adjacent_stores++;
>                       continue;
>                     }
>                   tree newref, newoff;
> @@ -8905,9 +8890,26 @@ vectorizable_store (vec_info *vinfo,
>        if (costing_p)
>         {
>           if (n_adjacent_stores > 0)
> -           vect_get_store_cost (vinfo, stmt_info, slp_node, 
> n_adjacent_stores,
> -                                alignment_support_scheme, misalignment,
> -                                &inside_cost, cost_vec);
> +           {
> +             /* Take a single lane vector type store as scalar
> +                store to avoid ICE like 110776.  */
> +             if (VECTOR_TYPE_P (ltype)
> +                 && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
> +               vect_get_store_cost (vinfo, stmt_info, slp_node,
> +                                    n_adjacent_stores, 
> alignment_support_scheme,
> +                                    misalignment, &inside_cost, cost_vec);
> +             else
> +               inside_cost
> +                 += record_stmt_cost (cost_vec, n_adjacent_stores,
> +                                      scalar_store, stmt_info, 0, vect_body);
> +             /* Only need vector extracting when there are more
> +                than one stores.  */
> +             if (nstores > 1)
> +               inside_cost
> +                 += record_stmt_cost (cost_vec, n_adjacent_stores,
> +                                      vec_to_scalar, stmt_info, slp_node,
> +                                      0, vect_body);
> +           }
>           if (dump_enabled_p ())
>             dump_printf_loc (MSG_NOTE, vect_location,
>                              "vect_model_store_cost: inside_cost = %d, "
> --
> 2.44.0
> >
> >> Thanks, Jennifer
> >>>
> >>>> +                   n_adjacent_stores, alignment_support_scheme,
> >>>> +                   misalignment, &inside_cost, cost_vec);
> >>>> +        }
> >>>>   if (dump_enabled_p ())
> >>>>     dump_printf_loc (MSG_NOTE, vect_location,
> >>>>              "vect_model_store_cost: inside_cost = %d, "
> >>>> --
> >>>> 2.34.1
> >>>>>
> >>>>>> +             inside_cost
> >>>>>> +               += record_stmt_cost (cost_vec, n_adjacent_stores, 
> >>>>>> vec_to_scalar,
> >>>>>> +                                    stmt_info, slp_node,
> >>>>>> +                                    0, vect_body);
> >>>>>> +           }
> >>>>>>      if (dump_enabled_p ())
> >>>>>>        dump_printf_loc (MSG_NOTE, vect_location,
> >>>>>>                         "vect_model_store_cost: inside_cost = %d, "
> >>>>>> --
> >>>>>> 2.44.0
> >>>>>>
> >>>>>>
> >>>>>>>>
> >>>>>>>> Richard
> >>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Jennifer
> >>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Jennifer
> >>>>>>>>>>>
> >>>>>>>>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS 
> >>>>>>>>>>> tunable and
> >>>>>>>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes 
> >>>>>>>>>>> the
> >>>>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
> >>>>>>>>>>> default. To that end, the function aarch64_use_new_vector_costs_p 
> >>>>>>>>>>> and its uses
> >>>>>>>>>>> were removed. To prevent costing vec_to_scalar operations with 0, 
> >>>>>>>>>>> as
> >>>>>>>>>>> described in
> >>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
> >>>>>>>>>>> we guarded the call to vect_is_store_elt_extraction in
> >>>>>>>>>>> aarch64_vector_costs::add_stmt_cost by count > 1.
> >>>>>>>>>>>
> >>>>>>>>>>> Two tests were adjusted due to changes in codegen. In both cases, 
> >>>>>>>>>>> the
> >>>>>>>>>>> old code performed loop unrolling once, but the new code does not:
> >>>>>>>>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled 
> >>>>>>>>>>> with
> >>>>>>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
> >>>>>>>>>>> -moverride=tune=none):
> >>>>>>>>>>> f_int64_t_32:
> >>>>>>>>>>> cbz     w3, .L92
> >>>>>>>>>>> mov     x4, 0
> >>>>>>>>>>> uxtw    x3, w3
> >>>>>>>>>>> +       cntd    x5
> >>>>>>>>>>> +       whilelo p7.d, xzr, x3
> >>>>>>>>>>> +       mov     z29.s, w5
> >>>>>>>>>>> mov     z31.s, w2
> >>>>>>>>>>> -       whilelo p6.d, xzr, x3
> >>>>>>>>>>> -       mov     x2, x3
> >>>>>>>>>>> -       index   z30.s, #0, #1
> >>>>>>>>>>> -       uqdecd  x2
> >>>>>>>>>>> -       ptrue   p5.b, all
> >>>>>>>>>>> -       whilelo p7.d, xzr, x2
> >>>>>>>>>>> +       index   z30.d, #0, #1
> >>>>>>>>>>> +       ptrue   p6.b, all
> >>>>>>>>>>> .p2align 3,,7
> >>>>>>>>>>> .L94:
> >>>>>>>>>>> -       ld1d    z27.d, p7/z, [x0, #1, mul vl]
> >>>>>>>>>>> -       ld1d    z28.d, p6/z, [x0]
> >>>>>>>>>>> -       movprfx z29, z31
> >>>>>>>>>>> -       mul     z29.s, p5/m, z29.s, z30.s
> >>>>>>>>>>> -       incw    x4
> >>>>>>>>>>> -       uunpklo z0.d, z29.s
> >>>>>>>>>>> -       uunpkhi z29.d, z29.s
> >>>>>>>>>>> -       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
> >>>>>>>>>>> -       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
> >>>>>>>>>>> -       add     z25.d, z28.d, z25.d
> >>>>>>>>>>> +       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
> >>>>>>>>>>> +       movprfx z28, z31
> >>>>>>>>>>> +       mul     z28.s, p6/m, z28.s, z30.s
> >>>>>>>>>>> +       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
> >>>>>>>>>>> add     z26.d, z27.d, z26.d
> >>>>>>>>>>> -       st1d    z26.d, p7, [x0, #1, mul vl]
> >>>>>>>>>>> -       whilelo p7.d, x4, x2
> >>>>>>>>>>> -       st1d    z25.d, p6, [x0]
> >>>>>>>>>>> -       incw    z30.s
> >>>>>>>>>>> -       incb    x0, all, mul #2
> >>>>>>>>>>> -       whilelo p6.d, x4, x3
> >>>>>>>>>>> +       st1d    z26.d, p7, [x0, x4, lsl 3]
> >>>>>>>>>>> +       add     z30.s, z30.s, z29.s
> >>>>>>>>>>> +       incd    x4
> >>>>>>>>>>> +       whilelo p7.d, x4, x3
> >>>>>>>>>>> b.any   .L94
> >>>>>>>>>>> .L92:
> >>>>>>>>>>> ret
> >>>>>>>>>>>
> >>>>>>>>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled 
> >>>>>>>>>>> with
> >>>>>>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
> >>>>>>>>>>> -moverride=tune=none):
> >>>>>>>>>>> f_int64_t_32:
> >>>>>>>>>>> cbz     w3, .L84
> >>>>>>>>>>> -       addvl   x5, x1, #1
> >>>>>>>>>>> mov     x4, 0
> >>>>>>>>>>> uxtw    x3, w3
> >>>>>>>>>>> -       mov     z31.s, w2
> >>>>>>>>>>> +       cntd    x5
> >>>>>>>>>>> whilelo p7.d, xzr, x3
> >>>>>>>>>>> -       mov     x2, x3
> >>>>>>>>>>> -       index   z30.s, #0, #1
> >>>>>>>>>>> -       uqdecd  x2
> >>>>>>>>>>> -       ptrue   p5.b, all
> >>>>>>>>>>> -       whilelo p6.d, xzr, x2
> >>>>>>>>>>> +       mov     z29.s, w5
> >>>>>>>>>>> +       mov     z31.s, w2
> >>>>>>>>>>> +       index   z30.d, #0, #1
> >>>>>>>>>>> +       ptrue   p6.b, all
> >>>>>>>>>>> .p2align 3,,7
> >>>>>>>>>>> .L86:
> >>>>>>>>>>> -       ld1d    z28.d, p7/z, [x1, x4, lsl 3]
> >>>>>>>>>>> -       ld1d    z27.d, p6/z, [x5, x4, lsl 3]
> >>>>>>>>>>> -       movprfx z29, z30
> >>>>>>>>>>> -       mul     z29.s, p5/m, z29.s, z31.s
> >>>>>>>>>>> -       add     z28.d, z28.d, #1
> >>>>>>>>>>> -       uunpklo z26.d, z29.s
> >>>>>>>>>>> -       st1d    z28.d, p7, [x0, z26.d, lsl 3]
> >>>>>>>>>>> -       incw    x4
> >>>>>>>>>>> -       uunpkhi z29.d, z29.s
> >>>>>>>>>>> +       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
> >>>>>>>>>>> +       movprfx z28, z30
> >>>>>>>>>>> +       mul     z28.s, p6/m, z28.s, z31.s
> >>>>>>>>>>> add     z27.d, z27.d, #1
> >>>>>>>>>>> -       whilelo p6.d, x4, x2
> >>>>>>>>>>> -       st1d    z27.d, p7, [x0, z29.d, lsl 3]
> >>>>>>>>>>> -       incw    z30.s
> >>>>>>>>>>> +       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
> >>>>>>>>>>> +       incd    x4
> >>>>>>>>>>> +       add     z30.s, z30.s, z29.s
> >>>>>>>>>>> whilelo p7.d, x4, x3
> >>>>>>>>>>> b.any   .L86
> >>>>>>>>>>> .L84:
> >>>>>>>>>>> ret
> >>>>>>>>>>>
> >>>>>>>>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no
> >>>>>>>>>>> regression. We also ran SPEC2017 with -mcpu=generic on a Grace 
> >>>>>>>>>>> machine and saw
> >>>>>>>>>>> no non-noise impact on performance. We would appreciate help with 
> >>>>>>>>>>> wider
> >>>>>>>>>>> benchmarking on other platforms, if necessary.
> >>>>>>>>>>> OK for mainline?
> >>>>>>>>>>>
> >>>>>>>>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
> >>>>>>>>>>>
> >>>>>>>>>>> gcc/
> >>>>>>>>>>> * config/aarch64/aarch64-tuning-flags.def: Remove
> >>>>>>>>>>> use_new_vector_costs as tuning option.
> >>>>>>>>>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
> >>>>>>>>>>> Remove.
> >>>>>>>>>>> (aarch64_vector_costs::add_stmt_cost): Remove use of
> >>>>>>>>>>> aarch64_use_new_vector_costs_p and guard call to
> >>>>>>>>>>> vect_is_store_elt_extraction with count > 1.
> >>>>>>>>>>> (aarch64_vector_costs::finish_cost): Remove use of
> >>>>>>>>>>> aarch64_use_new_vector_costs_p.
> >>>>>>>>>>> * config/aarch64/tuning_models/cortexx925.h: Remove
> >>>>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
> >>>>>>>>>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
> >>>>>>>>>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
> >>>>>>>>>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
> >>>>>>>>>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
> >>>>>>>>>>> * config/aarch64/tuning_models/neoversen2.h: Likewise.
> >>>>>>>>>>> * config/aarch64/tuning_models/neoversen3.h: Likewise.
> >>>>>>>>>>> * config/aarch64/tuning_models/neoversev1.h: Likewise.
> >>>>>>>>>>> * config/aarch64/tuning_models/neoversev2.h: Likewise.
> >>>>>>>>>>> * config/aarch64/tuning_models/neoversev3.h: Likewise.
> >>>>>>>>>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
> >>>>>>>>>>>
> >>>>>>>>>>> gcc/testsuite/
> >>>>>>>>>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected 
> >>>>>>>>>>> outcome.
> >>>>>>>>>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
> >>>>>>>>>>> ---
> >>>>>>>>>>> gcc/config/aarch64/aarch64-tuning-flags.def   |  2 --
> >>>>>>>>>>> gcc/config/aarch64/aarch64.cc                 | 22 
> >>>>>>>>>>> +++++--------------
> >>>>>>>>>>> gcc/config/aarch64/tuning_models/cortexx925.h |  1 -
> >>>>>>>>>>> .../aarch64/tuning_models/fujitsu_monaka.h    |  1 -
> >>>>>>>>>>> .../aarch64/tuning_models/generic_armv8_a.h   |  1 -
> >>>>>>>>>>> .../aarch64/tuning_models/generic_armv9_a.h   |  1 -
> >>>>>>>>>>> .../aarch64/tuning_models/neoverse512tvb.h    |  1 -
> >>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversen2.h |  1 -
> >>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversen3.h |  1 -
> >>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev1.h |  1 -
> >>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev2.h |  1 -
> >>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev3.h |  1 -
> >>>>>>>>>>> .../aarch64/tuning_models/neoversev3ae.h      |  1 -
> >>>>>>>>>>> .../gcc.target/aarch64/sve/strided_load_2.c   |  2 +-
> >>>>>>>>>>> .../gcc.target/aarch64/sve/strided_store_2.c  |  2 +-
> >>>>>>>>>>> 15 files changed, 7 insertions(+), 32 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
> >>>>>>>>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>>>>>>>>> index 5939602576b..ed345b13ed3 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>>>>>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>>>>>>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION 
> >>>>>>>>>>> ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
> >>>>>>>>>>>
> >>>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", 
> >>>>>>>>>>> CSE_SVE_VL_CONSTANTS)
> >>>>>>>>>>>
> >>>>>>>>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", 
> >>>>>>>>>>> USE_NEW_VECTOR_COSTS)
> >>>>>>>>>>> -
> >>>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", 
> >>>>>>>>>>> MATCHED_VECTOR_THROUGHPUT)
> >>>>>>>>>>>
> >>>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", 
> >>>>>>>>>>> AVOID_CROSS_LOOP_FMA)
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/aarch64.cc 
> >>>>>>>>>>> b/gcc/config/aarch64/aarch64.cc
> >>>>>>>>>>> index 43238aefef2..03806671c97 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/aarch64.cc
> >>>>>>>>>>> +++ b/gcc/config/aarch64/aarch64.cc
> >>>>>>>>>>> @@ -16566,16 +16566,6 @@ aarch64_vectorize_create_costs (vec_info 
> >>>>>>>>>>> *vinfo, bool costing_for_scalar)
> >>>>>>>>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar);
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> -/* Return true if the current CPU should use the new costs 
> >>>>>>>>>>> defined
> >>>>>>>>>>> -   in GCC 11.  This should be removed for GCC 12 and above, with 
> >>>>>>>>>>> the
> >>>>>>>>>>> -   costs applying to all CPUs instead.  */
> >>>>>>>>>>> -static bool
> >>>>>>>>>>> -aarch64_use_new_vector_costs_p ()
> >>>>>>>>>>> -{
> >>>>>>>>>>> -  return (aarch64_tune_params.extra_tuning_flags
> >>>>>>>>>>> -       & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
> >>>>>>>>>>> -}
> >>>>>>>>>>> -
> >>>>>>>>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE. 
> >>>>>>>>>>>  */
> >>>>>>>>>>> static const simd_vec_cost *
> >>>>>>>>>>> aarch64_simd_vec_costs (tree vectype)
> >>>>>>>>>>> @@ -17494,7 +17484,7 @@ aarch64_vector_costs::add_stmt_cost (int 
> >>>>>>>>>>> count, vect_cost_for_stmt kind,
> >>>>>>>>>>>
> >>>>>>>>>>> /* Do one-time initialization based on the vinfo.  */
> >>>>>>>>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
> >>>>>>>>>>> -  if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
> >>>>>>>>>>> +  if (!m_analyzed_vinfo)
> >>>>>>>>>>> {
> >>>>>>>>>>> if (loop_vinfo)
> >>>>>>>>>>> analyze_loop_vinfo (loop_vinfo);
> >>>>>>>>>>> @@ -17512,12 +17502,12 @@ aarch64_vector_costs::add_stmt_cost 
> >>>>>>>>>>> (int count, vect_cost_for_stmt kind,
> >>>>>>>>>>>
> >>>>>>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
> >>>>>>>>>>> of just looking at KIND.  */
> >>>>>>>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
> >>>>>>>>>>> +  if (stmt_info)
> >>>>>>>>>>> {
> >>>>>>>>>>> /* If we scalarize a strided store, the vectorizer costs one
> >>>>>>>>>>> vec_to_scalar for each element.  However, we can store the first
> >>>>>>>>>>> element using an FP store without a separate extract step.  */
> >>>>>>>>>>> -      if (vect_is_store_elt_extraction (kind, stmt_info))
> >>>>>>>>>>> +      if (vect_is_store_elt_extraction (kind, stmt_info) && 
> >>>>>>>>>>> count > 1)
> >>>>>>>>>>> count -= 1;
> >>>>>>>>>>>
> >>>>>>>>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
> >>>>>>>>>>> @@ -17577,7 +17567,7 @@ aarch64_vector_costs::add_stmt_cost (int 
> >>>>>>>>>>> count, vect_cost_for_stmt kind,
> >>>>>>>>>>> else
> >>>>>>>>>>> m_num_last_promote_demote = 0;
> >>>>>>>>>>>
> >>>>>>>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
> >>>>>>>>>>> +  if (stmt_info)
> >>>>>>>>>>> {
> >>>>>>>>>>> /* Account for any extra "embedded" costs that apply additively
> >>>>>>>>>>> to the base cost calculated above.  */
> >>>>>>>>>>> @@ -17938,9 +17928,7 @@ aarch64_vector_costs::finish_cost (const 
> >>>>>>>>>>> vector_costs *uncast_scalar_costs)
> >>>>>>>>>>>
> >>>>>>>>>>> auto *scalar_costs
> >>>>>>>>>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
> >>>>>>>>>>> -  if (loop_vinfo
> >>>>>>>>>>> -      && m_vec_flags
> >>>>>>>>>>> -      && aarch64_use_new_vector_costs_p ())
> >>>>>>>>>>> +  if (loop_vinfo && m_vec_flags)
> >>>>>>>>>>> {
> >>>>>>>>>>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
> >>>>>>>>>>>                                    m_costs[vect_body]);
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>>>>>>>>> index eb9b89984b0..dafea96e924 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
> >>>>>>>>>>> cortexx925_tunings =
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags. */
> >>>>>>>>>>> &generic_prefetch_tune,
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h 
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>>>>>>>>> index 6a098497759..ac001927959 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>>>>>>>>> @@ -55,7 +55,6 @@ static const struct tune_params 
> >>>>>>>>>>> fujitsu_monaka_tunings =
> >>>>>>>>>>> 0, /* max_case_values.  */
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  
> >>>>>>>>>>> */
> >>>>>>>>>>> &generic_prefetch_tune,
> >>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h 
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>>>>>>>>> index 9b1cbfc5bd2..7b534831340 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>>>>>>>>> @@ -183,7 +183,6 @@ static const struct tune_params 
> >>>>>>>>>>> generic_armv8_a_tunings =
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  
> >>>>>>>>>>> */
> >>>>>>>>>>> &generic_prefetch_tune,
> >>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>>>>>>>>> index 48353a59939..562ef89c67b 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>>>>>>>>> @@ -249,7 +249,6 @@ static const struct tune_params 
> >>>>>>>>>>> generic_armv9_a_tunings =
> >>>>>>>>>>> 0, /* max_case_values.  */
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  
> >>>>>>>>>>> */
> >>>>>>>>>>> &generic_armv9a_prefetch_tune,
> >>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h 
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>>>>>>>>> index c407b89a22f..fe4f7c10f73 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>>>>>>>>> @@ -156,7 +156,6 @@ static const struct tune_params 
> >>>>>>>>>>> neoverse512tvb_tunings =
> >>>>>>>>>>> 0, /* max_case_values.  */
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  
> >>>>>>>>>>> */
> >>>>>>>>>>> &generic_prefetch_tune,
> >>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>>>>>>>>> index 18199ac206c..56be77423cb 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
> >>>>>>>>>>> neoversen2_tunings =
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags. */
> >>>>>>>>>>> &generic_prefetch_tune,
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h 
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>>>>>>>>> index 4da85cfac0d..254ad5e27f8 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
> >>>>>>>>>>> neoversen3_tunings =
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  
> >>>>>>>>>>> */
> >>>>>>>>>>> &generic_prefetch_tune,
> >>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>>>>>>>>> index dd9120eee48..c7241cf23d7 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>>>>>>>>> @@ -227,7 +227,6 @@ static const struct tune_params 
> >>>>>>>>>>> neoversev1_tunings =
> >>>>>>>>>>> 0, /* max_case_values.  */
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags. */
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>>>>>>>>> index 1369de73991..96f55940649 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>>>>>>>>> @@ -232,7 +232,6 @@ static const struct tune_params 
> >>>>>>>>>>> neoversev2_tunings =
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),        /* tune_flags.  
> >>>>>>>>>>> */
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h 
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>>>>>>>>> index d8c82255378..f62ae67d355 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
> >>>>>>>>>>> neoversev3_tunings =
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags. */
> >>>>>>>>>>> &generic_prefetch_tune,
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h 
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>>>>>>>>> index 7f050501ede..0233baf5e34 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
> >>>>>>>>>>> neoversev3ae_tunings =
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags. */
> >>>>>>>>>>> &generic_prefetch_tune,
> >>>>>>>>>>> diff --git 
> >>>>>>>>>>> a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c 
> >>>>>>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>>>>>>>>> index 762805ff54b..c334b7a6875 100644
> >>>>>>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>>>>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>>>>>>>>> @@ -15,4 +15,4 @@
> >>>>>>>>>>> so we vectorize the offset calculation.  This means that the
> >>>>>>>>>>> 64-bit version needs two copies.  */
> >>>>>>>>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, 
> >>>>>>>>>>> p[0-7]/z, \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
> >>>>>>>>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, 
> >>>>>>>>>>> p[0-7]/z, \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
> >>>>>>>>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, 
> >>>>>>>>>>> p[0-7]/z, \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
> >>>>>>>>>>> diff --git 
> >>>>>>>>>>> a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c 
> >>>>>>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>>>>>>>>> index f0ea58e38e2..94cc63049bc 100644
> >>>>>>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>>>>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>>>>>>>>> @@ -15,4 +15,4 @@
> >>>>>>>>>>> so we vectorize the offset calculation.  This means that the
> >>>>>>>>>>> 64-bit version needs two copies.  */
> >>>>>>>>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], 
> >>>>>>>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
> >>>>>>>>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, 
> >>>>>>>>>>> p[0-7], \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
> >>>>>>>>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, 
> >>>>>>>>>>> p[0-7], \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Richard Biener <rguent...@suse.de>
> >>>>>>>>>> SUSE Software Solutions Germany GmbH,
> >>>>>>>>>> Frankenstrasse 146, 90461 Nuernberg, Germany;
> >>>>>>>>>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG 
> >>>>>>>>>> Nuernberg)
>
>

Reply via email to