> Am 16.12.2024 um 09:10 schrieb Jennifer Schmitz <jschm...@nvidia.com>:
> 
> 
> 
>> On 14 Dec 2024, at 09:32, Richard Biener <rguent...@suse.de> wrote:
>> 
>> External email: Use caution opening links or attachments
>> 
>> 
>>>> Am 13.12.2024 um 18:00 schrieb Jennifer Schmitz <jschm...@nvidia.com>:
>>> 
>>> 
>>> 
>>>> On 13 Dec 2024, at 13:40, Richard Biener <richard.guent...@gmail.com> 
>>>> wrote:
>>>> 
>>>> External email: Use caution opening links or attachments
>>>> 
>>>> 
>>>>> On Thu, Dec 12, 2024 at 5:27 PM Jennifer Schmitz <jschm...@nvidia.com> 
>>>>> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 6 Dec 2024, at 08:41, Jennifer Schmitz <jschm...@nvidia.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 5 Dec 2024, at 20:07, Richard Sandiford <richard.sandif...@arm.com> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> External email: Use caution opening links or attachments
>>>>>>> 
>>>>>>> 
>>>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>>>>>>>> On 5 Dec 2024, at 11:44, Richard Biener <rguent...@suse.de> wrote:
>>>>>>>>> 
>>>>>>>>> External email: Use caution opening links or attachments
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, 5 Dec 2024, Jennifer Schmitz wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On 17 Oct 2024, at 19:23, Richard Sandiford 
>>>>>>>>>>> <richard.sandif...@arm.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> External email: Use caution opening links or attachments
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>>>>>>>>>>> [...]
>>>>>>>>>>>> Looking at the diff of the vect dumps (below is a section of the 
>>>>>>>>>>>> diff for strided_store_2.c), it seemed odd that vec_to_scalar 
>>>>>>>>>>>> operations cost 0 now, instead of the previous cost of 2:
>>>>>>>>>>>> 
>>>>>>>>>>>> +strided_store_1.c:38:151: note:    === vectorizable_operation ===
>>>>>>>>>>>> +strided_store_1.c:38:151: note:    vect_model_simple_cost: 
>>>>>>>>>>>> inside_cost = 1, prologue_cost  = 0 .
>>>>>>>>>>>> +strided_store_1.c:38:151: note:   ==> examining statement: *_6 = 
>>>>>>>>>>>> _7;
>>>>>>>>>>>> +strided_store_1.c:38:151: note:   vect_is_simple_use: operand _3 
>>>>>>>>>>>> + 1.0e+0, type of def:    internal
>>>>>>>>>>>> +strided_store_1.c:38:151: note:   Vectorizing an unaligned access.
>>>>>>>>>>>> +Applying pattern match.pd:236, generic-match-9.cc:4128
>>>>>>>>>>>> +Applying pattern match.pd:5285, generic-match-10.cc:4234
>>>>>>>>>>>> +strided_store_1.c:38:151: note:   vect_model_store_cost: 
>>>>>>>>>>>> inside_cost = 12, prologue_cost = 0 .
>>>>>>>>>>>> *_2 1 times unaligned_load (misalign -1) costs 1 in body
>>>>>>>>>>>> -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue
>>>>>>>>>>>> _3 + 1.0e+0 1 times vector_stmt costs 1 in body
>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>>>>>>>> +<unknown> 1 times vector_load costs 1 in prologue
>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>>>>>>>> 
>>>>>>>>>>>> Although the aarch64_use_new_vector_costs_p flag was used in 
>>>>>>>>>>>> multiple places in aarch64.cc, the location that causes this 
>>>>>>>>>>>> behavior is this one:
>>>>>>>>>>>> unsigned
>>>>>>>>>>>> aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt 
>>>>>>>>>>>> kind,
>>>>>>>>>>>>                              stmt_vec_info stmt_info, slp_tree,
>>>>>>>>>>>>                              tree vectype, int misalign,
>>>>>>>>>>>>                              vect_cost_model_location where)
>>>>>>>>>>>> {
>>>>>>>>>>>> [...]
>>>>>>>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
>>>>>>>>>>>> of just looking at KIND.  */
>>>>>>>>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>>>>>>>>> +  if (stmt_info)
>>>>>>>>>>>> {
>>>>>>>>>>>> /* If we scalarize a strided store, the vectorizer costs one
>>>>>>>>>>>>  vec_to_scalar for each element.  However, we can store the first
>>>>>>>>>>>>  element using an FP store without a separate extract step.  */
>>>>>>>>>>>> if (vect_is_store_elt_extraction (kind, stmt_info))
>>>>>>>>>>>> count -= 1;
>>>>>>>>>>>> 
>>>>>>>>>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
>>>>>>>>>>>>                                               stmt_info, 
>>>>>>>>>>>> stmt_cost);
>>>>>>>>>>>> 
>>>>>>>>>>>> if (vectype && m_vec_flags)
>>>>>>>>>>>> stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind,
>>>>>>>>>>>>                                                 stmt_info, vectype,
>>>>>>>>>>>>                                                 where, stmt_cost);
>>>>>>>>>>>> }
>>>>>>>>>>>> [...]
>>>>>>>>>>>> return record_stmt_cost (stmt_info, where, (count * 
>>>>>>>>>>>> stmt_cost).ceil ());
>>>>>>>>>>>> }
>>>>>>>>>>>> 
>>>>>>>>>>>> Previously, for mtune=generic, this function returned a cost of 2 
>>>>>>>>>>>> for a vec_to_scalar operation in the vect body. Now "if 
>>>>>>>>>>>> (stmt_info)" is entered and "if (vect_is_store_elt_extraction 
>>>>>>>>>>>> (kind, stmt_info))" evaluates to true, which sets the count to 0 
>>>>>>>>>>>> and leads to a return value of 0.
>>>>>>>>>>> 
>>>>>>>>>>> At the time the code was written, a scalarised store would be costed
>>>>>>>>>>> using one vec_to_scalar call into the backend, with the count 
>>>>>>>>>>> parameter
>>>>>>>>>>> set to the number of elements being stored.  The "count -= 1" was
>>>>>>>>>>> supposed to lop off the leading element extraction, since we can 
>>>>>>>>>>> store
>>>>>>>>>>> lane 0 as a normal FP store.
>>>>>>>>>>> 
>>>>>>>>>>> The target-independent costing was later reworked so that it costs
>>>>>>>>>>> each operation individually:
>>>>>>>>>>> 
>>>>>>>>>>>       for (i = 0; i < nstores; i++)
>>>>>>>>>>>         {
>>>>>>>>>>>           if (costing_p)
>>>>>>>>>>>             {
>>>>>>>>>>>               /* Only need vector extracting when there are more
>>>>>>>>>>>                  than one stores.  */
>>>>>>>>>>>               if (nstores > 1)
>>>>>>>>>>>                 inside_cost
>>>>>>>>>>>                   += record_stmt_cost (cost_vec, 1, vec_to_scalar,
>>>>>>>>>>>                                        stmt_info, 0, vect_body);
>>>>>>>>>>>               /* Take a single lane vector type store as scalar
>>>>>>>>>>>                  store to avoid ICE like 110776.  */
>>>>>>>>>>>               if (VECTOR_TYPE_P (ltype)
>>>>>>>>>>>                   && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
>>>>>>>>>>>                 n_adjacent_stores++;
>>>>>>>>>>>               else
>>>>>>>>>>>                 inside_cost
>>>>>>>>>>>                   += record_stmt_cost (cost_vec, 1, scalar_store,
>>>>>>>>>>>                                        stmt_info, 0, vect_body);
>>>>>>>>>>>               continue;
>>>>>>>>>>>             }
>>>>>>>>>>> 
>>>>>>>>>>> Unfortunately, there's no easy way of telling whether a particular 
>>>>>>>>>>> call
>>>>>>>>>>> is part of a group, and if so, which member of the group it is.
>>>>>>>>>>> 
>>>>>>>>>>> I suppose we could give up on the attempt to be (somewhat) accurate
>>>>>>>>>>> and just disable the optimisation.  Or we could restrict it to 
>>>>>>>>>>> count > 1,
>>>>>>>>>>> since it might still be useful for gathers and scatters.
>>>>>>>>>> I tried restricting the calls to vect_is_store_elt_extraction to 
>>>>>>>>>> count > 1 and it seems to resolve the issue of costing vec_to_scalar 
>>>>>>>>>> operations with 0 (see patch below).
>>>>>>>>>> What are your thoughts on this?
>>>>>>>>> 
>>>>>>>>> Why didn't you pursue instead moving the vec_to_scalar cost together
>>>>>>>>> with the n_adjacent_store handling?
>>>>>>>> When I continued working on this patch, we had already reached stage 3 
>>>>>>>> and I was hesitant to introduce changes to the middle-end that were 
>>>>>>>> not previously covered by this patch. So I tried if the issue could 
>>>>>>>> not be resolved by making a small change in the backend.
>>>>>>>> If you still advise to use the n_adjacent_store instead, I’m happy to 
>>>>>>>> look into it again.
>>>>>>> 
>>>>>>> If Richard's ok with adjusting vectorizable_store for GCC 15 (which it
>>>>>>> sounds like he is), then I agree that would be better.  Otherwise we'd
>>>>>>> be creating technical debt to clean up for GCC 16.  And it is a 
>>>>>>> regression
>>>>>>> of sorts, so is stage 3 material from that POV.
>>>>>>> 
>>>>>>> (Incidentally, AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS was itself a
>>>>>>> "let's clean this up next stage 1" thing, since we needed to add tuning
>>>>>>> for a new CPU late during the cycle.  But of course, there were other
>>>>>>> priorities when stage 1 actually came around, so it never actually
>>>>>>> happened.  Thanks again for being the one to sort this out.)
>>>>>> Thanks for your feedback. Then I will try to make it work in 
>>>>>> vectorizable_store.
>>>>>> Best,
>>>>>> Jennifer
>>>>> Below is the updated patch with a suggestion for the changes in 
>>>>> vectorizable_store. It resolves the issue with the vec_to_scalar 
>>>>> operations that were individually costed with 0.
>>>>> We already tested it on aarch64, no regression, but we are still doing 
>>>>> performance testing.
>>>>> Can you give some feedback in the meantime on the patch itself?
>>>>> Thanks,
>>>>> Jennifer
>>>>> 
>>>>> 
>>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and
>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
>>>>> default. To that end, the function aarch64_use_new_vector_costs_p and its 
>>>>> uses
>>>>> were removed. To prevent costing vec_to_scalar operations with 0, as
>>>>> described in
>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
>>>>> we adjusted vectorizable_store such that the variable n_adjacent_stores
>>>>> also covers vec_to_scalar operations. This way vec_to_scalar operations
>>>>> are not costed individually, but as a group.
>>>>> 
>>>>> Two tests were adjusted due to changes in codegen. In both cases, the
>>>>> old code performed loop unrolling once, but the new code does not:
>>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>>>>> -moverride=tune=none):
>>>>> f_int64_t_32:
>>>>>     cbz     w3, .L92
>>>>>     mov     x4, 0
>>>>>     uxtw    x3, w3
>>>>> +       cntd    x5
>>>>> +       whilelo p7.d, xzr, x3
>>>>> +       mov     z29.s, w5
>>>>>     mov     z31.s, w2
>>>>> -       whilelo p6.d, xzr, x3
>>>>> -       mov     x2, x3
>>>>> -       index   z30.s, #0, #1
>>>>> -       uqdecd  x2
>>>>> -       ptrue   p5.b, all
>>>>> -       whilelo p7.d, xzr, x2
>>>>> +       index   z30.d, #0, #1
>>>>> +       ptrue   p6.b, all
>>>>>     .p2align 3,,7
>>>>> .L94:
>>>>> -       ld1d    z27.d, p7/z, [x0, #1, mul vl]
>>>>> -       ld1d    z28.d, p6/z, [x0]
>>>>> -       movprfx z29, z31
>>>>> -       mul     z29.s, p5/m, z29.s, z30.s
>>>>> -       incw    x4
>>>>> -       uunpklo z0.d, z29.s
>>>>> -       uunpkhi z29.d, z29.s
>>>>> -       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
>>>>> -       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
>>>>> -       add     z25.d, z28.d, z25.d
>>>>> +       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
>>>>> +       movprfx z28, z31
>>>>> +       mul     z28.s, p6/m, z28.s, z30.s
>>>>> +       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
>>>>>     add     z26.d, z27.d, z26.d
>>>>> -       st1d    z26.d, p7, [x0, #1, mul vl]
>>>>> -       whilelo p7.d, x4, x2
>>>>> -       st1d    z25.d, p6, [x0]
>>>>> -       incw    z30.s
>>>>> -       incb    x0, all, mul #2
>>>>> -       whilelo p6.d, x4, x3
>>>>> +       st1d    z26.d, p7, [x0, x4, lsl 3]
>>>>> +       add     z30.s, z30.s, z29.s
>>>>> +       incd    x4
>>>>> +       whilelo p7.d, x4, x3
>>>>>     b.any   .L94
>>>>> .L92:
>>>>>     ret
>>>>> 
>>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>>>>> -moverride=tune=none):
>>>>> f_int64_t_32:
>>>>>     cbz     w3, .L84
>>>>> -       addvl   x5, x1, #1
>>>>>     mov     x4, 0
>>>>>     uxtw    x3, w3
>>>>> -       mov     z31.s, w2
>>>>> +       cntd    x5
>>>>>     whilelo p7.d, xzr, x3
>>>>> -       mov     x2, x3
>>>>> -       index   z30.s, #0, #1
>>>>> -       uqdecd  x2
>>>>> -       ptrue   p5.b, all
>>>>> -       whilelo p6.d, xzr, x2
>>>>> +       mov     z29.s, w5
>>>>> +       mov     z31.s, w2
>>>>> +       index   z30.d, #0, #1
>>>>> +       ptrue   p6.b, all
>>>>>     .p2align 3,,7
>>>>> .L86:
>>>>> -       ld1d    z28.d, p7/z, [x1, x4, lsl 3]
>>>>> -       ld1d    z27.d, p6/z, [x5, x4, lsl 3]
>>>>> -       movprfx z29, z30
>>>>> -       mul     z29.s, p5/m, z29.s, z31.s
>>>>> -       add     z28.d, z28.d, #1
>>>>> -       uunpklo z26.d, z29.s
>>>>> -       st1d    z28.d, p7, [x0, z26.d, lsl 3]
>>>>> -       incw    x4
>>>>> -       uunpkhi z29.d, z29.s
>>>>> +       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
>>>>> +       movprfx z28, z30
>>>>> +       mul     z28.s, p6/m, z28.s, z31.s
>>>>>     add     z27.d, z27.d, #1
>>>>> -       whilelo p6.d, x4, x2
>>>>> -       st1d    z27.d, p7, [x0, z29.d, lsl 3]
>>>>> -       incw    z30.s
>>>>> +       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
>>>>> +       incd    x4
>>>>> +       add     z30.s, z30.s, z29.s
>>>>>     whilelo p7.d, x4, x3
>>>>>     b.any   .L86
>>>>> .L84:
>>>>>     ret
>>>>> 
>>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no
>>>>> regression.
>>>>> OK for mainline?
>>>>> 
>>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>>>> 
>>>>> gcc/
>>>>>     * tree-vect-stmts.cc (vectorizable_store): Extend the use of
>>>>>     n_adjacent_stores to also cover vec_to_scalar operations.
>>>>>     * config/aarch64/aarch64-tuning-flags.def: Remove
>>>>>     use_new_vector_costs as tuning option.
>>>>>     * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
>>>>>     Remove.
>>>>>     (aarch64_vector_costs::add_stmt_cost): Remove use of
>>>>>     aarch64_use_new_vector_costs_p.
>>>>>     (aarch64_vector_costs::finish_cost): Remove use of
>>>>>     aarch64_use_new_vector_costs_p.
>>>>>     * config/aarch64/tuning_models/cortexx925.h: Remove
>>>>>     AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
>>>>>     * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
>>>>>     * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
>>>>>     * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
>>>>>     * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
>>>>>     * config/aarch64/tuning_models/neoversen2.h: Likewise.
>>>>>     * config/aarch64/tuning_models/neoversen3.h: Likewise.
>>>>>     * config/aarch64/tuning_models/neoversev1.h: Likewise.
>>>>>     * config/aarch64/tuning_models/neoversev2.h: Likewise.
>>>>>     * config/aarch64/tuning_models/neoversev3.h: Likewise.
>>>>>     * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
>>>>> 
>>>>> gcc/testsuite/
>>>>>     * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
>>>>>     * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
>>>>> ---
>>>>> gcc/config/aarch64/aarch64-tuning-flags.def   |  2 --
>>>>> gcc/config/aarch64/aarch64.cc                 | 20 +++----------
>>>>> gcc/config/aarch64/tuning_models/cortexx925.h |  1 -
>>>>> .../aarch64/tuning_models/fujitsu_monaka.h    |  1 -
>>>>> .../aarch64/tuning_models/generic_armv8_a.h   |  1 -
>>>>> .../aarch64/tuning_models/generic_armv9_a.h   |  1 -
>>>>> .../aarch64/tuning_models/neoverse512tvb.h    |  1 -
>>>>> gcc/config/aarch64/tuning_models/neoversen2.h |  1 -
>>>>> gcc/config/aarch64/tuning_models/neoversen3.h |  1 -
>>>>> gcc/config/aarch64/tuning_models/neoversev1.h |  1 -
>>>>> gcc/config/aarch64/tuning_models/neoversev2.h |  1 -
>>>>> gcc/config/aarch64/tuning_models/neoversev3.h |  1 -
>>>>> .../aarch64/tuning_models/neoversev3ae.h      |  1 -
>>>>> .../gcc.target/aarch64/sve/strided_load_2.c   |  2 +-
>>>>> .../gcc.target/aarch64/sve/strided_store_2.c  |  2 +-
>>>>> gcc/tree-vect-stmts.cc                        | 29 ++++++++++---------
>>>>> 16 files changed, 22 insertions(+), 44 deletions(-)
>>>>> 
>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
>>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>> index ffbff20e29c..1de633c739b 100644
>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", 
>>>>> CHEAP_SHIFT_EXTEND)
>>>>> 
>>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
>>>>> 
>>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", 
>>>>> USE_NEW_VECTOR_COSTS)
>>>>> -
>>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", 
>>>>> MATCHED_VECTOR_THROUGHPUT)
>>>>> 
>>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA)
>>>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>>>>> index 77a2a6bfa3a..71fba9cc63b 100644
>>>>> --- a/gcc/config/aarch64/aarch64.cc
>>>>> +++ b/gcc/config/aarch64/aarch64.cc
>>>>> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo, 
>>>>> bool costing_for_scalar)
>>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar);
>>>>> }
>>>>> 
>>>>> -/* Return true if the current CPU should use the new costs defined
>>>>> -   in GCC 11.  This should be removed for GCC 12 and above, with the
>>>>> -   costs applying to all CPUs instead.  */
>>>>> -static bool
>>>>> -aarch64_use_new_vector_costs_p ()
>>>>> -{
>>>>> -  return (aarch64_tune_params.extra_tuning_flags
>>>>> -         & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
>>>>> -}
>>>>> -
>>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE.  */
>>>>> static const simd_vec_cost *
>>>>> aarch64_simd_vec_costs (tree vectype)
>>>>> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>>>>> vect_cost_for_stmt kind,
>>>>> 
>>>>> /* Do one-time initialization based on the vinfo.  */
>>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
>>>>> -  if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
>>>>> +  if (!m_analyzed_vinfo)
>>>>>  {
>>>>>    if (loop_vinfo)
>>>>>     analyze_loop_vinfo (loop_vinfo);
>>>>> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>>>>> vect_cost_for_stmt kind,
>>>>> 
>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
>>>>>   of just looking at KIND.  */
>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>> +  if (stmt_info)
>>>>>  {
>>>>>    /* If we scalarize a strided store, the vectorizer costs one
>>>>>      vec_to_scalar for each element.  However, we can store the first
>>>>> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>>>>> vect_cost_for_stmt kind,
>>>>> else
>>>>>  m_num_last_promote_demote = 0;
>>>>> 
>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>> +  if (stmt_info)
>>>>>  {
>>>>>    /* Account for any extra "embedded" costs that apply additively
>>>>>      to the base cost calculated above.  */
>>>>> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const 
>>>>> vector_costs *uncast_scalar_costs)
>>>>> 
>>>>> auto *scalar_costs
>>>>>  = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
>>>>> -  if (loop_vinfo
>>>>> -      && m_vec_flags
>>>>> -      && aarch64_use_new_vector_costs_p ())
>>>>> +  if (loop_vinfo && m_vec_flags)
>>>>>  {
>>>>>    m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>>>>                                          m_costs[vect_body]);
>>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
>>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>> index b2ff716157a..0a8eff69307 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>> @@ -219,7 +219,6 @@ static const struct tune_params cortexx925_tunings =
>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>>>>> &generic_prefetch_tune,
>>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h 
>>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>> index 2d704ecd110..a564528f43d 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings =
>>>>> 0,   /* max_case_values.  */
>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>>>> &generic_prefetch_tune,
>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h 
>>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>> index bdd309ab03d..f090d5cde50 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>> @@ -183,7 +183,6 @@ static const struct tune_params 
>>>>> generic_armv8_a_tunings =
>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>>>> &generic_prefetch_tune,
>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
>>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>> index a05a9ab92a2..4c33c147444 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>> @@ -249,7 +249,6 @@ static const struct tune_params 
>>>>> generic_armv9_a_tunings =
>>>>> 0,   /* max_case_values.  */
>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>>>> &generic_armv9a_prefetch_tune,
>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h 
>>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>> index c407b89a22f..fe4f7c10f73 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>> @@ -156,7 +156,6 @@ static const struct tune_params 
>>>>> neoverse512tvb_tunings =
>>>>> 0,   /* max_case_values.  */
>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>>>> &generic_prefetch_tune,
>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
>>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>> index fd5f8f37370..0c74068da2c 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings =
>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>>>>> &generic_prefetch_tune,
>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h 
>>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>> index 8b156c2fe4d..9d4e1be171a 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings =
>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>>>> &generic_prefetch_tune,
>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
>>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>> index 23c121d8652..85a78bb2bef 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings =
>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>>>>> &generic_prefetch_tune,
>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
>>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>> index 40af5f47f4f..1dd452beb8d 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>> @@ -232,7 +232,6 @@ static const struct tune_params neoversev2_tunings =
>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
>>>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),  /* tune_flags.  */
>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h 
>>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>> index d65d74bfecf..d0ba5b1aef6 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings =
>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>>>>> &generic_prefetch_tune,
>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h 
>>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>> index 7b7fa0b4b08..a1572048503 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings =
>>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>>>>> &generic_prefetch_tune,
>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c 
>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>> index 762805ff54b..c334b7a6875 100644
>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>> @@ -15,4 +15,4 @@
>>>>> so we vectorize the offset calculation.  This means that the
>>>>> 64-bit version needs two copies.  */
>>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, 
>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c 
>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>> index f0ea58e38e2..94cc63049bc 100644
>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>> @@ -15,4 +15,4 @@
>>>>> so we vectorize the offset calculation.  This means that the
>>>>> 64-bit version needs two copies.  */
>>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], 
>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
>>>>> index be1139a423c..6d7d28c4702 100644
>>>>> --- a/gcc/tree-vect-stmts.cc
>>>>> +++ b/gcc/tree-vect-stmts.cc
>>>>> @@ -8834,19 +8834,16 @@ vectorizable_store (vec_info *vinfo,
>>>>>             {
>>>>>               if (costing_p)
>>>>>                 {
>>>>> -                     /* Only need vector extracting when there are more
>>>>> -                        than one stores.  */
>>>>> -                     if (nstores > 1)
>>>>> -                       inside_cost
>>>>> -                         += record_stmt_cost (cost_vec, 1, vec_to_scalar,
>>>>> -                                              stmt_info, slp_node,
>>>>> -                                              0, vect_body);
>>>>>                   /* Take a single lane vector type store as scalar
>>>>>                      store to avoid ICE like 110776.  */
>>>>> -                     if (VECTOR_TYPE_P (ltype)
>>>>> -                         && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
>>>>> +                     bool single_lane_vec_p =
>>>>> +                       VECTOR_TYPE_P (ltype)
>>>>> +                       && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U);
>>>>> +                     /* Only need vector extracting when there are more
>>>>> +                        than one stores.  */
>>>>> +                     if (nstores > 1 || single_lane_vec_p)
>>>>>                     n_adjacent_stores++;
>>>>> -                     else
>>>>> +                     if (!single_lane_vec_p)
>>>> 
>>>> I think it's somewhat non-obvious that nstores > 1 and single_lane_vec_p
>>>> correlate.  In fact I think that we always record a store, just for
>>>> single-element
>>>> vectors we record scalar stores.  I suggest to here always to just
>>>> n_adjacent_stores++
>>>> and below ...
>>>> 
>>>>>                     inside_cost
>>>>>                       += record_stmt_cost (cost_vec, 1, scalar_store,
>>>>>                                            stmt_info, 0, vect_body);
>>>>> @@ -8905,9 +8902,15 @@ vectorizable_store (vec_info *vinfo,
>>>>>    if (costing_p)
>>>>>     {
>>>>>       if (n_adjacent_stores > 0)
>>>>> -           vect_get_store_cost (vinfo, stmt_info, slp_node, 
>>>>> n_adjacent_stores,
>>>>> -                                alignment_support_scheme, misalignment,
>>>>> -                                &inside_cost, cost_vec);
>>>>> +           {
>>>>> +             vect_get_store_cost (vinfo, stmt_info, slp_node, 
>>>>> n_adjacent_stores,
>>>>> +                                  alignment_support_scheme, misalignment,
>>>>> +                                  &inside_cost, cost_vec);
>>>> 
>>>> ... record n_adjacent_stores scalar_store when ltype is single-lane and 
>>>> record
>>>> n_adjacent_stores vect_to_scalar if nstores > 1 (and else none).
>>>> 
>>>> Richard.
>>> Thanks for the feedback, I’m glad it’s going in the right direction. Below 
>>> is the updated patch, re-validated on aarch64.
>>> Thanks, Jennifer
>>> 
>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and
>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
>>> default. To that end, the function aarch64_use_new_vector_costs_p and its 
>>> uses
>>> were removed. To prevent costing vec_to_scalar operations with 0, as
>>> described in
>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
>>> we adjusted vectorizable_store such that the variable n_adjacent_stores
>>> also covers vec_to_scalar operations. This way vec_to_scalar operations
>>> are not costed individually, but as a group.
>>> 
>>> Two tests were adjusted due to changes in codegen. In both cases, the
>>> old code performed loop unrolling once, but the new code does not:
>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>>> -moverride=tune=none):
>>> f_int64_t_32:
>>>      cbz     w3, .L92
>>>      mov     x4, 0
>>>      uxtw    x3, w3
>>> +       cntd    x5
>>> +       whilelo p7.d, xzr, x3
>>> +       mov     z29.s, w5
>>>      mov     z31.s, w2
>>> -       whilelo p6.d, xzr, x3
>>> -       mov     x2, x3
>>> -       index   z30.s, #0, #1
>>> -       uqdecd  x2
>>> -       ptrue   p5.b, all
>>> -       whilelo p7.d, xzr, x2
>>> +       index   z30.d, #0, #1
>>> +       ptrue   p6.b, all
>>>      .p2align 3,,7
>>> .L94:
>>> -       ld1d    z27.d, p7/z, [x0, #1, mul vl]
>>> -       ld1d    z28.d, p6/z, [x0]
>>> -       movprfx z29, z31
>>> -       mul     z29.s, p5/m, z29.s, z30.s
>>> -       incw    x4
>>> -       uunpklo z0.d, z29.s
>>> -       uunpkhi z29.d, z29.s
>>> -       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
>>> -       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
>>> -       add     z25.d, z28.d, z25.d
>>> +       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
>>> +       movprfx z28, z31
>>> +       mul     z28.s, p6/m, z28.s, z30.s
>>> +       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
>>>      add     z26.d, z27.d, z26.d
>>> -       st1d    z26.d, p7, [x0, #1, mul vl]
>>> -       whilelo p7.d, x4, x2
>>> -       st1d    z25.d, p6, [x0]
>>> -       incw    z30.s
>>> -       incb    x0, all, mul #2
>>> -       whilelo p6.d, x4, x3
>>> +       st1d    z26.d, p7, [x0, x4, lsl 3]
>>> +       add     z30.s, z30.s, z29.s
>>> +       incd    x4
>>> +       whilelo p7.d, x4, x3
>>>      b.any   .L94
>>> .L92:
>>>      ret
>>> 
>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>>> -moverride=tune=none):
>>> f_int64_t_32:
>>>      cbz     w3, .L84
>>> -       addvl   x5, x1, #1
>>>      mov     x4, 0
>>>      uxtw    x3, w3
>>> -       mov     z31.s, w2
>>> +       cntd    x5
>>>      whilelo p7.d, xzr, x3
>>> -       mov     x2, x3
>>> -       index   z30.s, #0, #1
>>> -       uqdecd  x2
>>> -       ptrue   p5.b, all
>>> -       whilelo p6.d, xzr, x2
>>> +       mov     z29.s, w5
>>> +       mov     z31.s, w2
>>> +       index   z30.d, #0, #1
>>> +       ptrue   p6.b, all
>>>      .p2align 3,,7
>>> .L86:
>>> -       ld1d    z28.d, p7/z, [x1, x4, lsl 3]
>>> -       ld1d    z27.d, p6/z, [x5, x4, lsl 3]
>>> -       movprfx z29, z30
>>> -       mul     z29.s, p5/m, z29.s, z31.s
>>> -       add     z28.d, z28.d, #1
>>> -       uunpklo z26.d, z29.s
>>> -       st1d    z28.d, p7, [x0, z26.d, lsl 3]
>>> -       incw    x4
>>> -       uunpkhi z29.d, z29.s
>>> +       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
>>> +       movprfx z28, z30
>>> +       mul     z28.s, p6/m, z28.s, z31.s
>>>      add     z27.d, z27.d, #1
>>> -       whilelo p6.d, x4, x2
>>> -       st1d    z27.d, p7, [x0, z29.d, lsl 3]
>>> -       incw    z30.s
>>> +       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
>>> +       incd    x4
>>> +       add     z30.s, z30.s, z29.s
>>>      whilelo p7.d, x4, x3
>>>      b.any   .L86
>>> .L84:
>>>  ret
>>> 
>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no
>>> regression.
>>> OK for mainline?
>>> 
>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>> 
>>> gcc/
>>>  * tree-vect-stmts.cc (vectorizable_store): Extend the use of
>>>  n_adjacent_stores to also cover vec_to_scalar operations.
>>>  * config/aarch64/aarch64-tuning-flags.def: Remove
>>>  use_new_vector_costs as tuning option.
>>>  * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
>>>  Remove.
>>>  (aarch64_vector_costs::add_stmt_cost): Remove use of
>>>  aarch64_use_new_vector_costs_p.
>>>  (aarch64_vector_costs::finish_cost): Remove use of
>>>  aarch64_use_new_vector_costs_p.
>>>  * config/aarch64/tuning_models/cortexx925.h: Remove
>>>  AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
>>>  * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
>>>  * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
>>>  * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
>>>  * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
>>>  * config/aarch64/tuning_models/neoversen2.h: Likewise.
>>>  * config/aarch64/tuning_models/neoversen3.h: Likewise.
>>>  * config/aarch64/tuning_models/neoversev1.h: Likewise.
>>>  * config/aarch64/tuning_models/neoversev2.h: Likewise.
>>>  * config/aarch64/tuning_models/neoversev3.h: Likewise.
>>>  * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
>>> 
>>> gcc/testsuite/
>>>  * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
>>>  * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
>>> ---
>>> gcc/config/aarch64/aarch64-tuning-flags.def   |  2 -
>>> gcc/config/aarch64/aarch64.cc                 | 20 ++--------
>>> gcc/config/aarch64/tuning_models/cortexx925.h |  1 -
>>> .../aarch64/tuning_models/fujitsu_monaka.h    |  1 -
>>> .../aarch64/tuning_models/generic_armv8_a.h   |  1 -
>>> .../aarch64/tuning_models/generic_armv9_a.h   |  1 -
>>> .../aarch64/tuning_models/neoverse512tvb.h    |  1 -
>>> gcc/config/aarch64/tuning_models/neoversen2.h |  1 -
>>> gcc/config/aarch64/tuning_models/neoversen3.h |  1 -
>>> gcc/config/aarch64/tuning_models/neoversev1.h |  1 -
>>> gcc/config/aarch64/tuning_models/neoversev2.h |  1 -
>>> gcc/config/aarch64/tuning_models/neoversev3.h |  1 -
>>> .../aarch64/tuning_models/neoversev3ae.h      |  1 -
>>> .../gcc.target/aarch64/sve/strided_load_2.c   |  2 +-
>>> .../gcc.target/aarch64/sve/strided_store_2.c  |  2 +-
>>> gcc/tree-vect-stmts.cc                        | 37 +++++++++++--------
>>> 16 files changed, 27 insertions(+), 47 deletions(-)
>>> 
>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
>>> index ffbff20e29c..1de633c739b 100644
>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", 
>>> CHEAP_SHIFT_EXTEND)
>>> 
>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
>>> 
>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS)
>>> -
>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", 
>>> MATCHED_VECTOR_THROUGHPUT)
>>> 
>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA)
>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>>> index 77a2a6bfa3a..71fba9cc63b 100644
>>> --- a/gcc/config/aarch64/aarch64.cc
>>> +++ b/gcc/config/aarch64/aarch64.cc
>>> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo, 
>>> bool costing_for_scalar)
>>> return new aarch64_vector_costs (vinfo, costing_for_scalar);
>>> }
>>> 
>>> -/* Return true if the current CPU should use the new costs defined
>>> -   in GCC 11.  This should be removed for GCC 12 and above, with the
>>> -   costs applying to all CPUs instead.  */
>>> -static bool
>>> -aarch64_use_new_vector_costs_p ()
>>> -{
>>> -  return (aarch64_tune_params.extra_tuning_flags
>>> -      & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
>>> -}
>>> -
>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE.  */
>>> static const simd_vec_cost *
>>> aarch64_simd_vec_costs (tree vectype)
>>> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>>> vect_cost_for_stmt kind,
>>> 
>>> /* Do one-time initialization based on the vinfo.  */
>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
>>> -  if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
>>> +  if (!m_analyzed_vinfo)
>>>   {
>>>     if (loop_vinfo)
>>>  analyze_loop_vinfo (loop_vinfo);
>>> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>>> vect_cost_for_stmt kind,
>>> 
>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
>>>    of just looking at KIND.  */
>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>> +  if (stmt_info)
>>>   {
>>>     /* If we scalarize a strided store, the vectorizer costs one
>>>   vec_to_scalar for each element.  However, we can store the first
>>> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>>> vect_cost_for_stmt kind,
>>> else
>>>   m_num_last_promote_demote = 0;
>>> 
>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>> +  if (stmt_info)
>>>   {
>>>     /* Account for any extra "embedded" costs that apply additively
>>>   to the base cost calculated above.  */
>>> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const 
>>> vector_costs *uncast_scalar_costs)
>>> 
>>> auto *scalar_costs
>>>   = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
>>> -  if (loop_vinfo
>>> -      && m_vec_flags
>>> -      && aarch64_use_new_vector_costs_p ())
>>> +  if (loop_vinfo && m_vec_flags)
>>>   {
>>>     m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>>                       m_costs[vect_body]);
>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
>>> b/gcc/config/aarch64/tuning_models/cortexx925.h
>>> index 5ebaf66e986..74772f3e15f 100644
>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
>>> @@ -221,7 +221,6 @@ static const struct tune_params cortexx925_tunings =
>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>> (AARCH64_EXTRA_TUNE_BASE
>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),    /* tune_flags.  */
>>> &generic_armv9a_prefetch_tune,
>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h 
>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>> index 2d704ecd110..a564528f43d 100644
>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings =
>>> 0,    /* max_case_values.  */
>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>> (AARCH64_EXTRA_TUNE_BASE
>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>> &generic_prefetch_tune,
>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h 
>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>> index bdd309ab03d..f090d5cde50 100644
>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>> @@ -183,7 +183,6 @@ static const struct tune_params generic_armv8_a_tunings 
>>> =
>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>> (AARCH64_EXTRA_TUNE_BASE
>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>> &generic_prefetch_tune,
>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>> index 785e00946bc..7b5821183bc 100644
>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>> @@ -251,7 +251,6 @@ static const struct tune_params generic_armv9_a_tunings 
>>> =
>>> 0,    /* max_case_values.  */
>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>> (AARCH64_EXTRA_TUNE_BASE
>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>> &generic_armv9a_prefetch_tune,
>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h 
>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>> index 007f987154c..f7457df59e5 100644
>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>> @@ -156,7 +156,6 @@ static const struct tune_params neoverse512tvb_tunings =
>>> 0,    /* max_case_values.  */
>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>> &generic_armv9a_prefetch_tune,
>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
>>> b/gcc/config/aarch64/tuning_models/neoversen2.h
>>> index 32560d2f5f8..541b61c8179 100644
>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings =
>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>> (AARCH64_EXTRA_TUNE_BASE
>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),    /* tune_flags.  */
>>> &generic_armv9a_prefetch_tune,
>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h 
>>> b/gcc/config/aarch64/tuning_models/neoversen3.h
>>> index 2010bc4645b..eff668132a8 100644
>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings =
>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>> (AARCH64_EXTRA_TUNE_BASE
>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>> &generic_armv9a_prefetch_tune,
>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
>>> b/gcc/config/aarch64/tuning_models/neoversev1.h
>>> index c3751e32696..d11472b6e1e 100644
>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
>>> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings =
>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>> (AARCH64_EXTRA_TUNE_BASE
>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),    /* tune_flags.  */
>>> &generic_armv9a_prefetch_tune,
>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
>>> b/gcc/config/aarch64/tuning_models/neoversev2.h
>>> index 80dbe5c806c..ee77ffdd3bc 100644
>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev2_tunings =
>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>> (AARCH64_EXTRA_TUNE_BASE
>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
>>>  | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),    /* tune_flags.  */
>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h 
>>> b/gcc/config/aarch64/tuning_models/neoversev3.h
>>> index efe09e16d1e..6ef143ef7d5 100644
>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings =
>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>> (AARCH64_EXTRA_TUNE_BASE
>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),    /* tune_flags.  */
>>> &generic_armv9a_prefetch_tune,
>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h 
>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>> index 66849f30889..96bdbf971f1 100644
>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings =
>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>> (AARCH64_EXTRA_TUNE_BASE
>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),    /* tune_flags.  */
>>> &generic_armv9a_prefetch_tune,
>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c 
>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>> index 762805ff54b..c334b7a6875 100644
>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>> @@ -15,4 +15,4 @@
>>>  so we vectorize the offset calculation.  This means that the
>>>  64-bit version needs two copies.  */
>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, 
>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c 
>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>> index f0ea58e38e2..94cc63049bc 100644
>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>> @@ -15,4 +15,4 @@
>>>  so we vectorize the offset calculation.  This means that the
>>>  64-bit version needs two copies.  */
>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], 
>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
>>> index be1139a423c..ab57163c243 100644
>>> --- a/gcc/tree-vect-stmts.cc
>>> +++ b/gcc/tree-vect-stmts.cc
>>> @@ -8834,19 +8834,8 @@ vectorizable_store (vec_info *vinfo,
>>>      {
>>>        if (costing_p)
>>>          {
>>> -              /* Only need vector extracting when there are more
>>> -             than one stores.  */
>>> -              if (nstores > 1)
>>> -            inside_cost
>>> -              += record_stmt_cost (cost_vec, 1, vec_to_scalar,
>>> -                           stmt_info, slp_node,
>>> -                           0, vect_body);
>>> -              /* Take a single lane vector type store as scalar
>>> -             store to avoid ICE like 110776.  */
>>> -              if (VECTOR_TYPE_P (ltype)
>>> -              && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
>>> -            n_adjacent_stores++;
>>> -              else
>>> +              n_adjacent_stores++;
>>> +              if (!VECTOR_TYPE_P (ltype))
>> 
>> This should be combined with the Single lane Vector case belle
>> 
>>>          inside_cost
>>>            += record_stmt_cost (cost_vec, 1, scalar_store,
>>>                         stmt_info, 0, vect_body);
>>> @@ -8905,9 +8894,25 @@ vectorizable_store (vec_info *vinfo,
>>>     if (costing_p)
>>>  {
>>>    if (n_adjacent_stores > 0)
>>> -        vect_get_store_cost (vinfo, stmt_info, slp_node, n_adjacent_stores,
>>> -                 alignment_support_scheme, misalignment,
>>> -                 &inside_cost, cost_vec);
>>> +        {
>>> +          /* Take a single lane vector type store as scalar
>>> +         store to avoid ICE like 110776.  */
>>> +          if (VECTOR_TYPE_P (ltype)
>>> +          && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
>>> +        inside_cost
>>> +          += record_stmt_cost (cost_vec, n_adjacent_stores,
>>> +                       scalar_store, stmt_info, 0, vect_body);
>>> +          /* Only need vector extracting when there are more
>>> +         than one stores.  */
>>> +          if (nstores > 1)
>>> +        inside_cost
>>> +          += record_stmt_cost (cost_vec, n_adjacent_stores,
>>> +                       vec_to_scalar, stmt_info, slp_node,
>>> +                       0, vect_body);
>>> +          vect_get_store_cost (vinfo, stmt_info, slp_node,
>> 
>> This should be Inlay done for Multi-lane vectors
> Thanks for the quick reply. As I am making the changes, I am wondering: Do we 
> even need n_adjacent_stores anymore? It appears to always have the same value 
> as nstores. Can we remove it and use nstores instead or does it still serve 
> another purpose?

It was a heuristic needed for powerpc(?), can you confirm we’re not combining 
stores from VF unrolling for strided SLP stores?

> Thanks, Jennifer
>> 
>>> +                   n_adjacent_stores, alignment_support_scheme,
>>> +                   misalignment, &inside_cost, cost_vec);
>>> +        }
>>>    if (dump_enabled_p ())
>>>      dump_printf_loc (MSG_NOTE, vect_location,
>>>               "vect_model_store_cost: inside_cost = %d, "
>>> --
>>> 2.34.1
>>>> 
>>>>> +             inside_cost
>>>>> +               += record_stmt_cost (cost_vec, n_adjacent_stores, 
>>>>> vec_to_scalar,
>>>>> +                                    stmt_info, slp_node,
>>>>> +                                    0, vect_body);
>>>>> +           }
>>>>>       if (dump_enabled_p ())
>>>>>         dump_printf_loc (MSG_NOTE, vect_location,
>>>>>                          "vect_model_store_cost: inside_cost = %d, "
>>>>> --
>>>>> 2.44.0
>>>>> 
>>>>> 
>>>>>>> 
>>>>>>> Richard
>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Jennifer
>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Jennifer
>>>>>>>>>> 
>>>>>>>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS 
>>>>>>>>>> tunable and
>>>>>>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
>>>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
>>>>>>>>>> default. To that end, the function aarch64_use_new_vector_costs_p 
>>>>>>>>>> and its uses
>>>>>>>>>> were removed. To prevent costing vec_to_scalar operations with 0, as
>>>>>>>>>> described in
>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
>>>>>>>>>> we guarded the call to vect_is_store_elt_extraction in
>>>>>>>>>> aarch64_vector_costs::add_stmt_cost by count > 1.
>>>>>>>>>> 
>>>>>>>>>> Two tests were adjusted due to changes in codegen. In both cases, the
>>>>>>>>>> old code performed loop unrolling once, but the new code does not:
>>>>>>>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
>>>>>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>>>>>>>>>> -moverride=tune=none):
>>>>>>>>>> f_int64_t_32:
>>>>>>>>>>  cbz     w3, .L92
>>>>>>>>>>  mov     x4, 0
>>>>>>>>>>  uxtw    x3, w3
>>>>>>>>>> +       cntd    x5
>>>>>>>>>> +       whilelo p7.d, xzr, x3
>>>>>>>>>> +       mov     z29.s, w5
>>>>>>>>>>  mov     z31.s, w2
>>>>>>>>>> -       whilelo p6.d, xzr, x3
>>>>>>>>>> -       mov     x2, x3
>>>>>>>>>> -       index   z30.s, #0, #1
>>>>>>>>>> -       uqdecd  x2
>>>>>>>>>> -       ptrue   p5.b, all
>>>>>>>>>> -       whilelo p7.d, xzr, x2
>>>>>>>>>> +       index   z30.d, #0, #1
>>>>>>>>>> +       ptrue   p6.b, all
>>>>>>>>>>  .p2align 3,,7
>>>>>>>>>> .L94:
>>>>>>>>>> -       ld1d    z27.d, p7/z, [x0, #1, mul vl]
>>>>>>>>>> -       ld1d    z28.d, p6/z, [x0]
>>>>>>>>>> -       movprfx z29, z31
>>>>>>>>>> -       mul     z29.s, p5/m, z29.s, z30.s
>>>>>>>>>> -       incw    x4
>>>>>>>>>> -       uunpklo z0.d, z29.s
>>>>>>>>>> -       uunpkhi z29.d, z29.s
>>>>>>>>>> -       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
>>>>>>>>>> -       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
>>>>>>>>>> -       add     z25.d, z28.d, z25.d
>>>>>>>>>> +       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
>>>>>>>>>> +       movprfx z28, z31
>>>>>>>>>> +       mul     z28.s, p6/m, z28.s, z30.s
>>>>>>>>>> +       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
>>>>>>>>>>  add     z26.d, z27.d, z26.d
>>>>>>>>>> -       st1d    z26.d, p7, [x0, #1, mul vl]
>>>>>>>>>> -       whilelo p7.d, x4, x2
>>>>>>>>>> -       st1d    z25.d, p6, [x0]
>>>>>>>>>> -       incw    z30.s
>>>>>>>>>> -       incb    x0, all, mul #2
>>>>>>>>>> -       whilelo p6.d, x4, x3
>>>>>>>>>> +       st1d    z26.d, p7, [x0, x4, lsl 3]
>>>>>>>>>> +       add     z30.s, z30.s, z29.s
>>>>>>>>>> +       incd    x4
>>>>>>>>>> +       whilelo p7.d, x4, x3
>>>>>>>>>>  b.any   .L94
>>>>>>>>>> .L92:
>>>>>>>>>>  ret
>>>>>>>>>> 
>>>>>>>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
>>>>>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>>>>>>>>>> -moverride=tune=none):
>>>>>>>>>> f_int64_t_32:
>>>>>>>>>>  cbz     w3, .L84
>>>>>>>>>> -       addvl   x5, x1, #1
>>>>>>>>>>  mov     x4, 0
>>>>>>>>>>  uxtw    x3, w3
>>>>>>>>>> -       mov     z31.s, w2
>>>>>>>>>> +       cntd    x5
>>>>>>>>>>  whilelo p7.d, xzr, x3
>>>>>>>>>> -       mov     x2, x3
>>>>>>>>>> -       index   z30.s, #0, #1
>>>>>>>>>> -       uqdecd  x2
>>>>>>>>>> -       ptrue   p5.b, all
>>>>>>>>>> -       whilelo p6.d, xzr, x2
>>>>>>>>>> +       mov     z29.s, w5
>>>>>>>>>> +       mov     z31.s, w2
>>>>>>>>>> +       index   z30.d, #0, #1
>>>>>>>>>> +       ptrue   p6.b, all
>>>>>>>>>>  .p2align 3,,7
>>>>>>>>>> .L86:
>>>>>>>>>> -       ld1d    z28.d, p7/z, [x1, x4, lsl 3]
>>>>>>>>>> -       ld1d    z27.d, p6/z, [x5, x4, lsl 3]
>>>>>>>>>> -       movprfx z29, z30
>>>>>>>>>> -       mul     z29.s, p5/m, z29.s, z31.s
>>>>>>>>>> -       add     z28.d, z28.d, #1
>>>>>>>>>> -       uunpklo z26.d, z29.s
>>>>>>>>>> -       st1d    z28.d, p7, [x0, z26.d, lsl 3]
>>>>>>>>>> -       incw    x4
>>>>>>>>>> -       uunpkhi z29.d, z29.s
>>>>>>>>>> +       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
>>>>>>>>>> +       movprfx z28, z30
>>>>>>>>>> +       mul     z28.s, p6/m, z28.s, z31.s
>>>>>>>>>>  add     z27.d, z27.d, #1
>>>>>>>>>> -       whilelo p6.d, x4, x2
>>>>>>>>>> -       st1d    z27.d, p7, [x0, z29.d, lsl 3]
>>>>>>>>>> -       incw    z30.s
>>>>>>>>>> +       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
>>>>>>>>>> +       incd    x4
>>>>>>>>>> +       add     z30.s, z30.s, z29.s
>>>>>>>>>>  whilelo p7.d, x4, x3
>>>>>>>>>>  b.any   .L86
>>>>>>>>>> .L84:
>>>>>>>>>> ret
>>>>>>>>>> 
>>>>>>>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no
>>>>>>>>>> regression. We also ran SPEC2017 with -mcpu=generic on a Grace 
>>>>>>>>>> machine and saw
>>>>>>>>>> no non-noise impact on performance. We would appreciate help with 
>>>>>>>>>> wider
>>>>>>>>>> benchmarking on other platforms, if necessary.
>>>>>>>>>> OK for mainline?
>>>>>>>>>> 
>>>>>>>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>>>>>>>>> 
>>>>>>>>>> gcc/
>>>>>>>>>> * config/aarch64/aarch64-tuning-flags.def: Remove
>>>>>>>>>> use_new_vector_costs as tuning option.
>>>>>>>>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
>>>>>>>>>> Remove.
>>>>>>>>>> (aarch64_vector_costs::add_stmt_cost): Remove use of
>>>>>>>>>> aarch64_use_new_vector_costs_p and guard call to
>>>>>>>>>> vect_is_store_elt_extraction with count > 1.
>>>>>>>>>> (aarch64_vector_costs::finish_cost): Remove use of
>>>>>>>>>> aarch64_use_new_vector_costs_p.
>>>>>>>>>> * config/aarch64/tuning_models/cortexx925.h: Remove
>>>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
>>>>>>>>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
>>>>>>>>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
>>>>>>>>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
>>>>>>>>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
>>>>>>>>>> * config/aarch64/tuning_models/neoversen2.h: Likewise.
>>>>>>>>>> * config/aarch64/tuning_models/neoversen3.h: Likewise.
>>>>>>>>>> * config/aarch64/tuning_models/neoversev1.h: Likewise.
>>>>>>>>>> * config/aarch64/tuning_models/neoversev2.h: Likewise.
>>>>>>>>>> * config/aarch64/tuning_models/neoversev3.h: Likewise.
>>>>>>>>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
>>>>>>>>>> 
>>>>>>>>>> gcc/testsuite/
>>>>>>>>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
>>>>>>>>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
>>>>>>>>>> ---
>>>>>>>>>> gcc/config/aarch64/aarch64-tuning-flags.def   |  2 --
>>>>>>>>>> gcc/config/aarch64/aarch64.cc                 | 22 
>>>>>>>>>> +++++--------------
>>>>>>>>>> gcc/config/aarch64/tuning_models/cortexx925.h |  1 -
>>>>>>>>>> .../aarch64/tuning_models/fujitsu_monaka.h    |  1 -
>>>>>>>>>> .../aarch64/tuning_models/generic_armv8_a.h   |  1 -
>>>>>>>>>> .../aarch64/tuning_models/generic_armv9_a.h   |  1 -
>>>>>>>>>> .../aarch64/tuning_models/neoverse512tvb.h    |  1 -
>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversen2.h |  1 -
>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversen3.h |  1 -
>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev1.h |  1 -
>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev2.h |  1 -
>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev3.h |  1 -
>>>>>>>>>> .../aarch64/tuning_models/neoversev3ae.h      |  1 -
>>>>>>>>>> .../gcc.target/aarch64/sve/strided_load_2.c   |  2 +-
>>>>>>>>>> .../gcc.target/aarch64/sve/strided_store_2.c  |  2 +-
>>>>>>>>>> 15 files changed, 7 insertions(+), 32 deletions(-)
>>>>>>>>>> 
>>>>>>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
>>>>>>>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>>>>>> index 5939602576b..ed345b13ed3 100644
>>>>>>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>>>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", 
>>>>>>>>>> CHEAP_SHIFT_EXTEND)
>>>>>>>>>> 
>>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", 
>>>>>>>>>> CSE_SVE_VL_CONSTANTS)
>>>>>>>>>> 
>>>>>>>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", 
>>>>>>>>>> USE_NEW_VECTOR_COSTS)
>>>>>>>>>> -
>>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", 
>>>>>>>>>> MATCHED_VECTOR_THROUGHPUT)
>>>>>>>>>> 
>>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", 
>>>>>>>>>> AVOID_CROSS_LOOP_FMA)
>>>>>>>>>> diff --git a/gcc/config/aarch64/aarch64.cc 
>>>>>>>>>> b/gcc/config/aarch64/aarch64.cc
>>>>>>>>>> index 43238aefef2..03806671c97 100644
>>>>>>>>>> --- a/gcc/config/aarch64/aarch64.cc
>>>>>>>>>> +++ b/gcc/config/aarch64/aarch64.cc
>>>>>>>>>> @@ -16566,16 +16566,6 @@ aarch64_vectorize_create_costs (vec_info 
>>>>>>>>>> *vinfo, bool costing_for_scalar)
>>>>>>>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar);
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> -/* Return true if the current CPU should use the new costs defined
>>>>>>>>>> -   in GCC 11.  This should be removed for GCC 12 and above, with the
>>>>>>>>>> -   costs applying to all CPUs instead.  */
>>>>>>>>>> -static bool
>>>>>>>>>> -aarch64_use_new_vector_costs_p ()
>>>>>>>>>> -{
>>>>>>>>>> -  return (aarch64_tune_params.extra_tuning_flags
>>>>>>>>>> -       & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
>>>>>>>>>> -}
>>>>>>>>>> -
>>>>>>>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE.  */
>>>>>>>>>> static const simd_vec_cost *
>>>>>>>>>> aarch64_simd_vec_costs (tree vectype)
>>>>>>>>>> @@ -17494,7 +17484,7 @@ aarch64_vector_costs::add_stmt_cost (int 
>>>>>>>>>> count, vect_cost_for_stmt kind,
>>>>>>>>>> 
>>>>>>>>>> /* Do one-time initialization based on the vinfo.  */
>>>>>>>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
>>>>>>>>>> -  if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
>>>>>>>>>> +  if (!m_analyzed_vinfo)
>>>>>>>>>> {
>>>>>>>>>> if (loop_vinfo)
>>>>>>>>>> analyze_loop_vinfo (loop_vinfo);
>>>>>>>>>> @@ -17512,12 +17502,12 @@ aarch64_vector_costs::add_stmt_cost (int 
>>>>>>>>>> count, vect_cost_for_stmt kind,
>>>>>>>>>> 
>>>>>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
>>>>>>>>>> of just looking at KIND.  */
>>>>>>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>>>>>>> +  if (stmt_info)
>>>>>>>>>> {
>>>>>>>>>> /* If we scalarize a strided store, the vectorizer costs one
>>>>>>>>>> vec_to_scalar for each element.  However, we can store the first
>>>>>>>>>> element using an FP store without a separate extract step.  */
>>>>>>>>>> -      if (vect_is_store_elt_extraction (kind, stmt_info))
>>>>>>>>>> +      if (vect_is_store_elt_extraction (kind, stmt_info) && count > 
>>>>>>>>>> 1)
>>>>>>>>>> count -= 1;
>>>>>>>>>> 
>>>>>>>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
>>>>>>>>>> @@ -17577,7 +17567,7 @@ aarch64_vector_costs::add_stmt_cost (int 
>>>>>>>>>> count, vect_cost_for_stmt kind,
>>>>>>>>>> else
>>>>>>>>>> m_num_last_promote_demote = 0;
>>>>>>>>>> 
>>>>>>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>>>>>>> +  if (stmt_info)
>>>>>>>>>> {
>>>>>>>>>> /* Account for any extra "embedded" costs that apply additively
>>>>>>>>>> to the base cost calculated above.  */
>>>>>>>>>> @@ -17938,9 +17928,7 @@ aarch64_vector_costs::finish_cost (const 
>>>>>>>>>> vector_costs *uncast_scalar_costs)
>>>>>>>>>> 
>>>>>>>>>> auto *scalar_costs
>>>>>>>>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
>>>>>>>>>> -  if (loop_vinfo
>>>>>>>>>> -      && m_vec_flags
>>>>>>>>>> -      && aarch64_use_new_vector_costs_p ())
>>>>>>>>>> +  if (loop_vinfo && m_vec_flags)
>>>>>>>>>> {
>>>>>>>>>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>>>>>>>>>                                     m_costs[vect_body]);
>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
>>>>>>>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>>>>>>> index eb9b89984b0..dafea96e924 100644
>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
>>>>>>>>>> cortexx925_tunings =
>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h 
>>>>>>>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>>>>>>> index 6a098497759..ac001927959 100644
>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>>>>>>> @@ -55,7 +55,6 @@ static const struct tune_params 
>>>>>>>>>> fujitsu_monaka_tunings =
>>>>>>>>>> 0, /* max_case_values.  */
>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h 
>>>>>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>>>>>>> index 9b1cbfc5bd2..7b534831340 100644
>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>>>>>>> @@ -183,7 +183,6 @@ static const struct tune_params 
>>>>>>>>>> generic_armv8_a_tunings =
>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
>>>>>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>>>>>>> index 48353a59939..562ef89c67b 100644
>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>>>>>>> @@ -249,7 +249,6 @@ static const struct tune_params 
>>>>>>>>>> generic_armv9_a_tunings =
>>>>>>>>>> 0, /* max_case_values.  */
>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>>>>>>> &generic_armv9a_prefetch_tune,
>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h 
>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>>>>>>> index c407b89a22f..fe4f7c10f73 100644
>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>>>>>>> @@ -156,7 +156,6 @@ static const struct tune_params 
>>>>>>>>>> neoverse512tvb_tunings =
>>>>>>>>>> 0, /* max_case_values.  */
>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>>>>>>> index 18199ac206c..56be77423cb 100644
>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
>>>>>>>>>> neoversen2_tunings =
>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h 
>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>>>>>>> index 4da85cfac0d..254ad5e27f8 100644
>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
>>>>>>>>>> neoversen3_tunings =
>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>>>>>>> index dd9120eee48..c7241cf23d7 100644
>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>>>>>>> @@ -227,7 +227,6 @@ static const struct tune_params 
>>>>>>>>>> neoversev1_tunings =
>>>>>>>>>> 0, /* max_case_values.  */
>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>>> | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>>>>>>> index 1369de73991..96f55940649 100644
>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>>>>>>> @@ -232,7 +232,6 @@ static const struct tune_params 
>>>>>>>>>> neoversev2_tunings =
>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
>>>>>>>>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),        /* tune_flags.  */
>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h 
>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>>>>>>> index d8c82255378..f62ae67d355 100644
>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
>>>>>>>>>> neoversev3_tunings =
>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h 
>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>>>>>>> index 7f050501ede..0233baf5e34 100644
>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
>>>>>>>>>> neoversev3ae_tunings =
>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c 
>>>>>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>>>>>>> index 762805ff54b..c334b7a6875 100644
>>>>>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>>>>>>> @@ -15,4 +15,4 @@
>>>>>>>>>> so we vectorize the offset calculation.  This means that the
>>>>>>>>>> 64-bit version needs two copies.  */
>>>>>>>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, 
>>>>>>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>>>>>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
>>>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>>>>>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
>>>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>>>>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c 
>>>>>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>>>>>>> index f0ea58e38e2..94cc63049bc 100644
>>>>>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>>>>>>> @@ -15,4 +15,4 @@
>>>>>>>>>> so we vectorize the offset calculation.  This means that the
>>>>>>>>>> 64-bit version needs two copies.  */
>>>>>>>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], 
>>>>>>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>>>>>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>>>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>>>>>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>>>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Richard Biener <rguent...@suse.de>
>>>>>>>>> SUSE Software Solutions Germany GmbH,
>>>>>>>>> Frankenstrasse 146, 90461 Nuernberg, Germany;
>>>>>>>>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG 
>>>>>>>>> Nuernberg)
> 
> 

Reply via email to