Re: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

Jennifer Schmitz Mon, 16 Dec 2024 00:10:36 -0800

> On 14 Dec 2024, at 09:32, Richard Biener <rguent...@suse.de> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
>> Am 13.12.2024 um 18:00 schrieb Jennifer Schmitz <jschm...@nvidia.com>:
>> 
>> 
>> 
>>> On 13 Dec 2024, at 13:40, Richard Biener <richard.guent...@gmail.com> wrote:
>>> 
>>> External email: Use caution opening links or attachments
>>> 
>>> 
>>>> On Thu, Dec 12, 2024 at 5:27 PM Jennifer Schmitz <jschm...@nvidia.com> 
>>>> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On 6 Dec 2024, at 08:41, Jennifer Schmitz <jschm...@nvidia.com> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 5 Dec 2024, at 20:07, Richard Sandiford <richard.sandif...@arm.com> 
>>>>>> wrote:
>>>>>> 
>>>>>> External email: Use caution opening links or attachments
>>>>>> 
>>>>>> 
>>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>>>>>>> On 5 Dec 2024, at 11:44, Richard Biener <rguent...@suse.de> wrote:
>>>>>>>> 
>>>>>>>> External email: Use caution opening links or attachments
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, 5 Dec 2024, Jennifer Schmitz wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 17 Oct 2024, at 19:23, Richard Sandiford 
>>>>>>>>>> <richard.sandif...@arm.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> External email: Use caution opening links or attachments
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>>>>>>>>>> [...]
>>>>>>>>>>> Looking at the diff of the vect dumps (below is a section of the 
>>>>>>>>>>> diff for strided_store_2.c), it seemed odd that vec_to_scalar 
>>>>>>>>>>> operations cost 0 now, instead of the previous cost of 2:
>>>>>>>>>>> 
>>>>>>>>>>> +strided_store_1.c:38:151: note:    === vectorizable_operation ===
>>>>>>>>>>> +strided_store_1.c:38:151: note:    vect_model_simple_cost: 
>>>>>>>>>>> inside_cost = 1, prologue_cost  = 0 .
>>>>>>>>>>> +strided_store_1.c:38:151: note:   ==> examining statement: *_6 = 
>>>>>>>>>>> _7;
>>>>>>>>>>> +strided_store_1.c:38:151: note:   vect_is_simple_use: operand _3 + 
>>>>>>>>>>> 1.0e+0, type of def:    internal
>>>>>>>>>>> +strided_store_1.c:38:151: note:   Vectorizing an unaligned access.
>>>>>>>>>>> +Applying pattern match.pd:236, generic-match-9.cc:4128
>>>>>>>>>>> +Applying pattern match.pd:5285, generic-match-10.cc:4234
>>>>>>>>>>> +strided_store_1.c:38:151: note:   vect_model_store_cost: 
>>>>>>>>>>> inside_cost = 12, prologue_cost = 0 .
>>>>>>>>>>> *_2 1 times unaligned_load (misalign -1) costs 1 in body
>>>>>>>>>>> -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue
>>>>>>>>>>> _3 + 1.0e+0 1 times vector_stmt costs 1 in body
>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>>>>>>> +<unknown> 1 times vector_load costs 1 in prologue
>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>>>>>>> 
>>>>>>>>>>> Although the aarch64_use_new_vector_costs_p flag was used in 
>>>>>>>>>>> multiple places in aarch64.cc, the location that causes this 
>>>>>>>>>>> behavior is this one:
>>>>>>>>>>> unsigned
>>>>>>>>>>> aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt 
>>>>>>>>>>> kind,
>>>>>>>>>>>                               stmt_vec_info stmt_info, slp_tree,
>>>>>>>>>>>                               tree vectype, int misalign,
>>>>>>>>>>>                               vect_cost_model_location where)
>>>>>>>>>>> {
>>>>>>>>>>> [...]
>>>>>>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
>>>>>>>>>>> of just looking at KIND.  */
>>>>>>>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>>>>>>>> +  if (stmt_info)
>>>>>>>>>>> {
>>>>>>>>>>> /* If we scalarize a strided store, the vectorizer costs one
>>>>>>>>>>>   vec_to_scalar for each element.  However, we can store the first
>>>>>>>>>>>   element using an FP store without a separate extract step.  */
>>>>>>>>>>> if (vect_is_store_elt_extraction (kind, stmt_info))
>>>>>>>>>>>  count -= 1;
>>>>>>>>>>> 
>>>>>>>>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
>>>>>>>>>>>                                                stmt_info, 
>>>>>>>>>>> stmt_cost);
>>>>>>>>>>> 
>>>>>>>>>>> if (vectype && m_vec_flags)
>>>>>>>>>>>  stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind,
>>>>>>>>>>>                                                  stmt_info, vectype,
>>>>>>>>>>>                                                  where, stmt_cost);
>>>>>>>>>>> }
>>>>>>>>>>> [...]
>>>>>>>>>>> return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil 
>>>>>>>>>>> ());
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> Previously, for mtune=generic, this function returned a cost of 2 
>>>>>>>>>>> for a vec_to_scalar operation in the vect body. Now "if 
>>>>>>>>>>> (stmt_info)" is entered and "if (vect_is_store_elt_extraction 
>>>>>>>>>>> (kind, stmt_info))" evaluates to true, which sets the count to 0 
>>>>>>>>>>> and leads to a return value of 0.
>>>>>>>>>> 
>>>>>>>>>> At the time the code was written, a scalarised store would be costed
>>>>>>>>>> using one vec_to_scalar call into the backend, with the count 
>>>>>>>>>> parameter
>>>>>>>>>> set to the number of elements being stored.  The "count -= 1" was
>>>>>>>>>> supposed to lop off the leading element extraction, since we can 
>>>>>>>>>> store
>>>>>>>>>> lane 0 as a normal FP store.
>>>>>>>>>> 
>>>>>>>>>> The target-independent costing was later reworked so that it costs
>>>>>>>>>> each operation individually:
>>>>>>>>>> 
>>>>>>>>>>        for (i = 0; i < nstores; i++)
>>>>>>>>>>          {
>>>>>>>>>>            if (costing_p)
>>>>>>>>>>              {
>>>>>>>>>>                /* Only need vector extracting when there are more
>>>>>>>>>>                   than one stores.  */
>>>>>>>>>>                if (nstores > 1)
>>>>>>>>>>                  inside_cost
>>>>>>>>>>                    += record_stmt_cost (cost_vec, 1, vec_to_scalar,
>>>>>>>>>>                                         stmt_info, 0, vect_body);
>>>>>>>>>>                /* Take a single lane vector type store as scalar
>>>>>>>>>>                   store to avoid ICE like 110776.  */
>>>>>>>>>>                if (VECTOR_TYPE_P (ltype)
>>>>>>>>>>                    && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
>>>>>>>>>>                  n_adjacent_stores++;
>>>>>>>>>>                else
>>>>>>>>>>                  inside_cost
>>>>>>>>>>                    += record_stmt_cost (cost_vec, 1, scalar_store,
>>>>>>>>>>                                         stmt_info, 0, vect_body);
>>>>>>>>>>                continue;
>>>>>>>>>>              }
>>>>>>>>>> 
>>>>>>>>>> Unfortunately, there's no easy way of telling whether a particular 
>>>>>>>>>> call
>>>>>>>>>> is part of a group, and if so, which member of the group it is.
>>>>>>>>>> 
>>>>>>>>>> I suppose we could give up on the attempt to be (somewhat) accurate
>>>>>>>>>> and just disable the optimisation.  Or we could restrict it to count 
>>>>>>>>>> > 1,
>>>>>>>>>> since it might still be useful for gathers and scatters.
>>>>>>>>> I tried restricting the calls to vect_is_store_elt_extraction to 
>>>>>>>>> count > 1 and it seems to resolve the issue of costing vec_to_scalar 
>>>>>>>>> operations with 0 (see patch below).
>>>>>>>>> What are your thoughts on this?
>>>>>>>> 
>>>>>>>> Why didn't you pursue instead moving the vec_to_scalar cost together
>>>>>>>> with the n_adjacent_store handling?
>>>>>>> When I continued working on this patch, we had already reached stage 3 
>>>>>>> and I was hesitant to introduce changes to the middle-end that were not 
>>>>>>> previously covered by this patch. So I tried if the issue could not be 
>>>>>>> resolved by making a small change in the backend.
>>>>>>> If you still advise to use the n_adjacent_store instead, I’m happy to 
>>>>>>> look into it again.
>>>>>> 
>>>>>> If Richard's ok with adjusting vectorizable_store for GCC 15 (which it
>>>>>> sounds like he is), then I agree that would be better.  Otherwise we'd
>>>>>> be creating technical debt to clean up for GCC 16.  And it is a 
>>>>>> regression
>>>>>> of sorts, so is stage 3 material from that POV.
>>>>>> 
>>>>>> (Incidentally, AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS was itself a
>>>>>> "let's clean this up next stage 1" thing, since we needed to add tuning
>>>>>> for a new CPU late during the cycle.  But of course, there were other
>>>>>> priorities when stage 1 actually came around, so it never actually
>>>>>> happened.  Thanks again for being the one to sort this out.)
>>>>> Thanks for your feedback. Then I will try to make it work in 
>>>>> vectorizable_store.
>>>>> Best,
>>>>> Jennifer
>>>> Below is the updated patch with a suggestion for the changes in 
>>>> vectorizable_store. It resolves the issue with the vec_to_scalar 
>>>> operations that were individually costed with 0.
>>>> We already tested it on aarch64, no regression, but we are still doing 
>>>> performance testing.
>>>> Can you give some feedback in the meantime on the patch itself?
>>>> Thanks,
>>>> Jennifer
>>>> 
>>>> 
>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and
>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
>>>> default. To that end, the function aarch64_use_new_vector_costs_p and its 
>>>> uses
>>>> were removed. To prevent costing vec_to_scalar operations with 0, as
>>>> described in
>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
>>>> we adjusted vectorizable_store such that the variable n_adjacent_stores
>>>> also covers vec_to_scalar operations. This way vec_to_scalar operations
>>>> are not costed individually, but as a group.
>>>> 
>>>> Two tests were adjusted due to changes in codegen. In both cases, the
>>>> old code performed loop unrolling once, but the new code does not:
>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>>>> -moverride=tune=none):
>>>> f_int64_t_32:
>>>>      cbz     w3, .L92
>>>>      mov     x4, 0
>>>>      uxtw    x3, w3
>>>> +       cntd    x5
>>>> +       whilelo p7.d, xzr, x3
>>>> +       mov     z29.s, w5
>>>>      mov     z31.s, w2
>>>> -       whilelo p6.d, xzr, x3
>>>> -       mov     x2, x3
>>>> -       index   z30.s, #0, #1
>>>> -       uqdecd  x2
>>>> -       ptrue   p5.b, all
>>>> -       whilelo p7.d, xzr, x2
>>>> +       index   z30.d, #0, #1
>>>> +       ptrue   p6.b, all
>>>>      .p2align 3,,7
>>>> .L94:
>>>> -       ld1d    z27.d, p7/z, [x0, #1, mul vl]
>>>> -       ld1d    z28.d, p6/z, [x0]
>>>> -       movprfx z29, z31
>>>> -       mul     z29.s, p5/m, z29.s, z30.s
>>>> -       incw    x4
>>>> -       uunpklo z0.d, z29.s
>>>> -       uunpkhi z29.d, z29.s
>>>> -       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
>>>> -       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
>>>> -       add     z25.d, z28.d, z25.d
>>>> +       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
>>>> +       movprfx z28, z31
>>>> +       mul     z28.s, p6/m, z28.s, z30.s
>>>> +       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
>>>>      add     z26.d, z27.d, z26.d
>>>> -       st1d    z26.d, p7, [x0, #1, mul vl]
>>>> -       whilelo p7.d, x4, x2
>>>> -       st1d    z25.d, p6, [x0]
>>>> -       incw    z30.s
>>>> -       incb    x0, all, mul #2
>>>> -       whilelo p6.d, x4, x3
>>>> +       st1d    z26.d, p7, [x0, x4, lsl 3]
>>>> +       add     z30.s, z30.s, z29.s
>>>> +       incd    x4
>>>> +       whilelo p7.d, x4, x3
>>>>      b.any   .L94
>>>> .L92:
>>>>      ret
>>>> 
>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>>>> -moverride=tune=none):
>>>> f_int64_t_32:
>>>>      cbz     w3, .L84
>>>> -       addvl   x5, x1, #1
>>>>      mov     x4, 0
>>>>      uxtw    x3, w3
>>>> -       mov     z31.s, w2
>>>> +       cntd    x5
>>>>      whilelo p7.d, xzr, x3
>>>> -       mov     x2, x3
>>>> -       index   z30.s, #0, #1
>>>> -       uqdecd  x2
>>>> -       ptrue   p5.b, all
>>>> -       whilelo p6.d, xzr, x2
>>>> +       mov     z29.s, w5
>>>> +       mov     z31.s, w2
>>>> +       index   z30.d, #0, #1
>>>> +       ptrue   p6.b, all
>>>>      .p2align 3,,7
>>>> .L86:
>>>> -       ld1d    z28.d, p7/z, [x1, x4, lsl 3]
>>>> -       ld1d    z27.d, p6/z, [x5, x4, lsl 3]
>>>> -       movprfx z29, z30
>>>> -       mul     z29.s, p5/m, z29.s, z31.s
>>>> -       add     z28.d, z28.d, #1
>>>> -       uunpklo z26.d, z29.s
>>>> -       st1d    z28.d, p7, [x0, z26.d, lsl 3]
>>>> -       incw    x4
>>>> -       uunpkhi z29.d, z29.s
>>>> +       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
>>>> +       movprfx z28, z30
>>>> +       mul     z28.s, p6/m, z28.s, z31.s
>>>>      add     z27.d, z27.d, #1
>>>> -       whilelo p6.d, x4, x2
>>>> -       st1d    z27.d, p7, [x0, z29.d, lsl 3]
>>>> -       incw    z30.s
>>>> +       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
>>>> +       incd    x4
>>>> +       add     z30.s, z30.s, z29.s
>>>>      whilelo p7.d, x4, x3
>>>>      b.any   .L86
>>>> .L84:
>>>>      ret
>>>> 
>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no
>>>> regression.
>>>> OK for mainline?
>>>> 
>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>>> 
>>>> gcc/
>>>>      * tree-vect-stmts.cc (vectorizable_store): Extend the use of
>>>>      n_adjacent_stores to also cover vec_to_scalar operations.
>>>>      * config/aarch64/aarch64-tuning-flags.def: Remove
>>>>      use_new_vector_costs as tuning option.
>>>>      * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
>>>>      Remove.
>>>>      (aarch64_vector_costs::add_stmt_cost): Remove use of
>>>>      aarch64_use_new_vector_costs_p.
>>>>      (aarch64_vector_costs::finish_cost): Remove use of
>>>>      aarch64_use_new_vector_costs_p.
>>>>      * config/aarch64/tuning_models/cortexx925.h: Remove
>>>>      AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
>>>>      * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
>>>>      * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
>>>>      * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
>>>>      * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
>>>>      * config/aarch64/tuning_models/neoversen2.h: Likewise.
>>>>      * config/aarch64/tuning_models/neoversen3.h: Likewise.
>>>>      * config/aarch64/tuning_models/neoversev1.h: Likewise.
>>>>      * config/aarch64/tuning_models/neoversev2.h: Likewise.
>>>>      * config/aarch64/tuning_models/neoversev3.h: Likewise.
>>>>      * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
>>>> 
>>>> gcc/testsuite/
>>>>      * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
>>>>      * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
>>>> ---
>>>> gcc/config/aarch64/aarch64-tuning-flags.def   |  2 --
>>>> gcc/config/aarch64/aarch64.cc                 | 20 +++----------
>>>> gcc/config/aarch64/tuning_models/cortexx925.h |  1 -
>>>> .../aarch64/tuning_models/fujitsu_monaka.h    |  1 -
>>>> .../aarch64/tuning_models/generic_armv8_a.h   |  1 -
>>>> .../aarch64/tuning_models/generic_armv9_a.h   |  1 -
>>>> .../aarch64/tuning_models/neoverse512tvb.h    |  1 -
>>>> gcc/config/aarch64/tuning_models/neoversen2.h |  1 -
>>>> gcc/config/aarch64/tuning_models/neoversen3.h |  1 -
>>>> gcc/config/aarch64/tuning_models/neoversev1.h |  1 -
>>>> gcc/config/aarch64/tuning_models/neoversev2.h |  1 -
>>>> gcc/config/aarch64/tuning_models/neoversev3.h |  1 -
>>>> .../aarch64/tuning_models/neoversev3ae.h      |  1 -
>>>> .../gcc.target/aarch64/sve/strided_load_2.c   |  2 +-
>>>> .../gcc.target/aarch64/sve/strided_store_2.c  |  2 +-
>>>> gcc/tree-vect-stmts.cc                        | 29 ++++++++++---------
>>>> 16 files changed, 22 insertions(+), 44 deletions(-)
>>>> 
>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>> index ffbff20e29c..1de633c739b 100644
>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", 
>>>> CHEAP_SHIFT_EXTEND)
>>>> 
>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
>>>> 
>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS)
>>>> -
>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", 
>>>> MATCHED_VECTOR_THROUGHPUT)
>>>> 
>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA)
>>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>>>> index 77a2a6bfa3a..71fba9cc63b 100644
>>>> --- a/gcc/config/aarch64/aarch64.cc
>>>> +++ b/gcc/config/aarch64/aarch64.cc
>>>> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo, 
>>>> bool costing_for_scalar)
>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar);
>>>> }
>>>> 
>>>> -/* Return true if the current CPU should use the new costs defined
>>>> -   in GCC 11.  This should be removed for GCC 12 and above, with the
>>>> -   costs applying to all CPUs instead.  */
>>>> -static bool
>>>> -aarch64_use_new_vector_costs_p ()
>>>> -{
>>>> -  return (aarch64_tune_params.extra_tuning_flags
>>>> -         & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
>>>> -}
>>>> -
>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE.  */
>>>> static const simd_vec_cost *
>>>> aarch64_simd_vec_costs (tree vectype)
>>>> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>>>> vect_cost_for_stmt kind,
>>>> 
>>>> /* Do one-time initialization based on the vinfo.  */
>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
>>>> -  if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
>>>> +  if (!m_analyzed_vinfo)
>>>>   {
>>>>     if (loop_vinfo)
>>>>      analyze_loop_vinfo (loop_vinfo);
>>>> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>>>> vect_cost_for_stmt kind,
>>>> 
>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
>>>>    of just looking at KIND.  */
>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>> +  if (stmt_info)
>>>>   {
>>>>     /* If we scalarize a strided store, the vectorizer costs one
>>>>       vec_to_scalar for each element.  However, we can store the first
>>>> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>>>> vect_cost_for_stmt kind,
>>>> else
>>>>   m_num_last_promote_demote = 0;
>>>> 
>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>> +  if (stmt_info)
>>>>   {
>>>>     /* Account for any extra "embedded" costs that apply additively
>>>>       to the base cost calculated above.  */
>>>> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const 
>>>> vector_costs *uncast_scalar_costs)
>>>> 
>>>> auto *scalar_costs
>>>>   = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
>>>> -  if (loop_vinfo
>>>> -      && m_vec_flags
>>>> -      && aarch64_use_new_vector_costs_p ())
>>>> +  if (loop_vinfo && m_vec_flags)
>>>>   {
>>>>     m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>>>                                           m_costs[vect_body]);
>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>> index b2ff716157a..0a8eff69307 100644
>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>> @@ -219,7 +219,6 @@ static const struct tune_params cortexx925_tunings =
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>>>> &generic_prefetch_tune,
>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h 
>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>> index 2d704ecd110..a564528f43d 100644
>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings =
>>>> 0,   /* max_case_values.  */
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_BASE
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>>> &generic_prefetch_tune,
>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h 
>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>> index bdd309ab03d..f090d5cde50 100644
>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>> @@ -183,7 +183,6 @@ static const struct tune_params 
>>>> generic_armv8_a_tunings =
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>>> &generic_prefetch_tune,
>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>> index a05a9ab92a2..4c33c147444 100644
>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>> @@ -249,7 +249,6 @@ static const struct tune_params 
>>>> generic_armv9_a_tunings =
>>>> 0,   /* max_case_values.  */
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_BASE
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>>> &generic_armv9a_prefetch_tune,
>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h 
>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>> index c407b89a22f..fe4f7c10f73 100644
>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>> @@ -156,7 +156,6 @@ static const struct tune_params neoverse512tvb_tunings 
>>>> =
>>>> 0,   /* max_case_values.  */
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>>> &generic_prefetch_tune,
>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>> index fd5f8f37370..0c74068da2c 100644
>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings =
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>>>> &generic_prefetch_tune,
>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h 
>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>> index 8b156c2fe4d..9d4e1be171a 100644
>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings =
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>>> &generic_prefetch_tune,
>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>> index 23c121d8652..85a78bb2bef 100644
>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings =
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>>>> &generic_prefetch_tune,
>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>> index 40af5f47f4f..1dd452beb8d 100644
>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>> @@ -232,7 +232,6 @@ static const struct tune_params neoversev2_tunings =
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
>>>>  | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),  /* tune_flags.  */
>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h 
>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>> index d65d74bfecf..d0ba5b1aef6 100644
>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings =
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>>>> &generic_prefetch_tune,
>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h 
>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>> index 7b7fa0b4b08..a1572048503 100644
>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings =
>>>> tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>>> (AARCH64_EXTRA_TUNE_BASE
>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>>>> &generic_prefetch_tune,
>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c 
>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>> index 762805ff54b..c334b7a6875 100644
>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>> @@ -15,4 +15,4 @@
>>>>  so we vectorize the offset calculation.  This means that the
>>>>  64-bit version needs two copies.  */
>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, 
>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c 
>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>> index f0ea58e38e2..94cc63049bc 100644
>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>> @@ -15,4 +15,4 @@
>>>>  so we vectorize the offset calculation.  This means that the
>>>>  64-bit version needs two copies.  */
>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], 
>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
>>>> index be1139a423c..6d7d28c4702 100644
>>>> --- a/gcc/tree-vect-stmts.cc
>>>> +++ b/gcc/tree-vect-stmts.cc
>>>> @@ -8834,19 +8834,16 @@ vectorizable_store (vec_info *vinfo,
>>>>              {
>>>>                if (costing_p)
>>>>                  {
>>>> -                     /* Only need vector extracting when there are more
>>>> -                        than one stores.  */
>>>> -                     if (nstores > 1)
>>>> -                       inside_cost
>>>> -                         += record_stmt_cost (cost_vec, 1, vec_to_scalar,
>>>> -                                              stmt_info, slp_node,
>>>> -                                              0, vect_body);
>>>>                    /* Take a single lane vector type store as scalar
>>>>                       store to avoid ICE like 110776.  */
>>>> -                     if (VECTOR_TYPE_P (ltype)
>>>> -                         && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
>>>> +                     bool single_lane_vec_p =
>>>> +                       VECTOR_TYPE_P (ltype)
>>>> +                       && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U);
>>>> +                     /* Only need vector extracting when there are more
>>>> +                        than one stores.  */
>>>> +                     if (nstores > 1 || single_lane_vec_p)
>>>>                      n_adjacent_stores++;
>>>> -                     else
>>>> +                     if (!single_lane_vec_p)
>>> 
>>> I think it's somewhat non-obvious that nstores > 1 and single_lane_vec_p
>>> correlate.  In fact I think that we always record a store, just for
>>> single-element
>>> vectors we record scalar stores.  I suggest to here always to just
>>> n_adjacent_stores++
>>> and below ...
>>> 
>>>>                      inside_cost
>>>>                        += record_stmt_cost (cost_vec, 1, scalar_store,
>>>>                                             stmt_info, 0, vect_body);
>>>> @@ -8905,9 +8902,15 @@ vectorizable_store (vec_info *vinfo,
>>>>     if (costing_p)
>>>>      {
>>>>        if (n_adjacent_stores > 0)
>>>> -           vect_get_store_cost (vinfo, stmt_info, slp_node, 
>>>> n_adjacent_stores,
>>>> -                                alignment_support_scheme, misalignment,
>>>> -                                &inside_cost, cost_vec);
>>>> +           {
>>>> +             vect_get_store_cost (vinfo, stmt_info, slp_node, 
>>>> n_adjacent_stores,
>>>> +                                  alignment_support_scheme, misalignment,
>>>> +                                  &inside_cost, cost_vec);
>>> 
>>> ... record n_adjacent_stores scalar_store when ltype is single-lane and 
>>> record
>>> n_adjacent_stores vect_to_scalar if nstores > 1 (and else none).
>>> 
>>> Richard.
>> Thanks for the feedback, I’m glad it’s going in the right direction. Below 
>> is the updated patch, re-validated on aarch64.
>> Thanks, Jennifer
>> 
>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and
>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
>> default. To that end, the function aarch64_use_new_vector_costs_p and its 
>> uses
>> were removed. To prevent costing vec_to_scalar operations with 0, as
>> described in
>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
>> we adjusted vectorizable_store such that the variable n_adjacent_stores
>> also covers vec_to_scalar operations. This way vec_to_scalar operations
>> are not costed individually, but as a group.
>> 
>> Two tests were adjusted due to changes in codegen. In both cases, the
>> old code performed loop unrolling once, but the new code does not:
>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>> -moverride=tune=none):
>> f_int64_t_32:
>>       cbz     w3, .L92
>>       mov     x4, 0
>>       uxtw    x3, w3
>> +       cntd    x5
>> +       whilelo p7.d, xzr, x3
>> +       mov     z29.s, w5
>>       mov     z31.s, w2
>> -       whilelo p6.d, xzr, x3
>> -       mov     x2, x3
>> -       index   z30.s, #0, #1
>> -       uqdecd  x2
>> -       ptrue   p5.b, all
>> -       whilelo p7.d, xzr, x2
>> +       index   z30.d, #0, #1
>> +       ptrue   p6.b, all
>>       .p2align 3,,7
>> .L94:
>> -       ld1d    z27.d, p7/z, [x0, #1, mul vl]
>> -       ld1d    z28.d, p6/z, [x0]
>> -       movprfx z29, z31
>> -       mul     z29.s, p5/m, z29.s, z30.s
>> -       incw    x4
>> -       uunpklo z0.d, z29.s
>> -       uunpkhi z29.d, z29.s
>> -       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
>> -       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
>> -       add     z25.d, z28.d, z25.d
>> +       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
>> +       movprfx z28, z31
>> +       mul     z28.s, p6/m, z28.s, z30.s
>> +       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
>>       add     z26.d, z27.d, z26.d
>> -       st1d    z26.d, p7, [x0, #1, mul vl]
>> -       whilelo p7.d, x4, x2
>> -       st1d    z25.d, p6, [x0]
>> -       incw    z30.s
>> -       incb    x0, all, mul #2
>> -       whilelo p6.d, x4, x3
>> +       st1d    z26.d, p7, [x0, x4, lsl 3]
>> +       add     z30.s, z30.s, z29.s
>> +       incd    x4
>> +       whilelo p7.d, x4, x3
>>       b.any   .L94
>> .L92:
>>       ret
>> 
>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>> -moverride=tune=none):
>> f_int64_t_32:
>>       cbz     w3, .L84
>> -       addvl   x5, x1, #1
>>       mov     x4, 0
>>       uxtw    x3, w3
>> -       mov     z31.s, w2
>> +       cntd    x5
>>       whilelo p7.d, xzr, x3
>> -       mov     x2, x3
>> -       index   z30.s, #0, #1
>> -       uqdecd  x2
>> -       ptrue   p5.b, all
>> -       whilelo p6.d, xzr, x2
>> +       mov     z29.s, w5
>> +       mov     z31.s, w2
>> +       index   z30.d, #0, #1
>> +       ptrue   p6.b, all
>>       .p2align 3,,7
>> .L86:
>> -       ld1d    z28.d, p7/z, [x1, x4, lsl 3]
>> -       ld1d    z27.d, p6/z, [x5, x4, lsl 3]
>> -       movprfx z29, z30
>> -       mul     z29.s, p5/m, z29.s, z31.s
>> -       add     z28.d, z28.d, #1
>> -       uunpklo z26.d, z29.s
>> -       st1d    z28.d, p7, [x0, z26.d, lsl 3]
>> -       incw    x4
>> -       uunpkhi z29.d, z29.s
>> +       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
>> +       movprfx z28, z30
>> +       mul     z28.s, p6/m, z28.s, z31.s
>>       add     z27.d, z27.d, #1
>> -       whilelo p6.d, x4, x2
>> -       st1d    z27.d, p7, [x0, z29.d, lsl 3]
>> -       incw    z30.s
>> +       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
>> +       incd    x4
>> +       add     z30.s, z30.s, z29.s
>>       whilelo p7.d, x4, x3
>>       b.any   .L86
>> .L84:
>>   ret
>> 
>> The patch was bootstrapped and tested on aarch64-linux-gnu, no
>> regression.
>> OK for mainline?
>> 
>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>> 
>> gcc/
>>   * tree-vect-stmts.cc (vectorizable_store): Extend the use of
>>   n_adjacent_stores to also cover vec_to_scalar operations.
>>   * config/aarch64/aarch64-tuning-flags.def: Remove
>>   use_new_vector_costs as tuning option.
>>   * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
>>   Remove.
>>   (aarch64_vector_costs::add_stmt_cost): Remove use of
>>   aarch64_use_new_vector_costs_p.
>>   (aarch64_vector_costs::finish_cost): Remove use of
>>   aarch64_use_new_vector_costs_p.
>>   * config/aarch64/tuning_models/cortexx925.h: Remove
>>   AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
>>   * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
>>   * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
>>   * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
>>   * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
>>   * config/aarch64/tuning_models/neoversen2.h: Likewise.
>>   * config/aarch64/tuning_models/neoversen3.h: Likewise.
>>   * config/aarch64/tuning_models/neoversev1.h: Likewise.
>>   * config/aarch64/tuning_models/neoversev2.h: Likewise.
>>   * config/aarch64/tuning_models/neoversev3.h: Likewise.
>>   * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
>> 
>> gcc/testsuite/
>>   * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
>>   * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
>> ---
>> gcc/config/aarch64/aarch64-tuning-flags.def   |  2 -
>> gcc/config/aarch64/aarch64.cc                 | 20 ++--------
>> gcc/config/aarch64/tuning_models/cortexx925.h |  1 -
>> .../aarch64/tuning_models/fujitsu_monaka.h    |  1 -
>> .../aarch64/tuning_models/generic_armv8_a.h   |  1 -
>> .../aarch64/tuning_models/generic_armv9_a.h   |  1 -
>> .../aarch64/tuning_models/neoverse512tvb.h    |  1 -
>> gcc/config/aarch64/tuning_models/neoversen2.h |  1 -
>> gcc/config/aarch64/tuning_models/neoversen3.h |  1 -
>> gcc/config/aarch64/tuning_models/neoversev1.h |  1 -
>> gcc/config/aarch64/tuning_models/neoversev2.h |  1 -
>> gcc/config/aarch64/tuning_models/neoversev3.h |  1 -
>> .../aarch64/tuning_models/neoversev3ae.h      |  1 -
>> .../gcc.target/aarch64/sve/strided_load_2.c   |  2 +-
>> .../gcc.target/aarch64/sve/strided_store_2.c  |  2 +-
>> gcc/tree-vect-stmts.cc                        | 37 +++++++++++--------
>> 16 files changed, 27 insertions(+), 47 deletions(-)
>> 
>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
>> b/gcc/config/aarch64/aarch64-tuning-flags.def
>> index ffbff20e29c..1de633c739b 100644
>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", 
>> CHEAP_SHIFT_EXTEND)
>> 
>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
>> 
>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS)
>> -
>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", 
>> MATCHED_VECTOR_THROUGHPUT)
>> 
>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA)
>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> index 77a2a6bfa3a..71fba9cc63b 100644
>> --- a/gcc/config/aarch64/aarch64.cc
>> +++ b/gcc/config/aarch64/aarch64.cc
>> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo, 
>> bool costing_for_scalar)
>>  return new aarch64_vector_costs (vinfo, costing_for_scalar);
>> }
>> 
>> -/* Return true if the current CPU should use the new costs defined
>> -   in GCC 11.  This should be removed for GCC 12 and above, with the
>> -   costs applying to all CPUs instead.  */
>> -static bool
>> -aarch64_use_new_vector_costs_p ()
>> -{
>> -  return (aarch64_tune_params.extra_tuning_flags
>> -      & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
>> -}
>> -
>> /* Return the appropriate SIMD costs for vectors of type VECTYPE.  */
>> static const simd_vec_cost *
>> aarch64_simd_vec_costs (tree vectype)
>> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>> vect_cost_for_stmt kind,
>> 
>>  /* Do one-time initialization based on the vinfo.  */
>>  loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
>> -  if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
>> +  if (!m_analyzed_vinfo)
>>    {
>>      if (loop_vinfo)
>>   analyze_loop_vinfo (loop_vinfo);
>> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>> vect_cost_for_stmt kind,
>> 
>>  /* Try to get a more accurate cost by looking at STMT_INFO instead
>>     of just looking at KIND.  */
>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>> +  if (stmt_info)
>>    {
>>      /* If we scalarize a strided store, the vectorizer costs one
>>    vec_to_scalar for each element.  However, we can store the first
>> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>> vect_cost_for_stmt kind,
>>  else
>>    m_num_last_promote_demote = 0;
>> 
>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>> +  if (stmt_info)
>>    {
>>      /* Account for any extra "embedded" costs that apply additively
>>    to the base cost calculated above.  */
>> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const 
>> vector_costs *uncast_scalar_costs)
>> 
>>  auto *scalar_costs
>>    = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
>> -  if (loop_vinfo
>> -      && m_vec_flags
>> -      && aarch64_use_new_vector_costs_p ())
>> +  if (loop_vinfo && m_vec_flags)
>>    {
>>      m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>                        m_costs[vect_body]);
>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
>> b/gcc/config/aarch64/tuning_models/cortexx925.h
>> index 5ebaf66e986..74772f3e15f 100644
>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
>> @@ -221,7 +221,6 @@ static const struct tune_params cortexx925_tunings =
>>  tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>  (AARCH64_EXTRA_TUNE_BASE
>>   | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),    /* tune_flags.  */
>>  &generic_armv9a_prefetch_tune,
>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h 
>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>> index 2d704ecd110..a564528f43d 100644
>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings =
>>  0,    /* max_case_values.  */
>>  tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>  (AARCH64_EXTRA_TUNE_BASE
>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>  &generic_prefetch_tune,
>>  AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h 
>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>> index bdd309ab03d..f090d5cde50 100644
>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>> @@ -183,7 +183,6 @@ static const struct tune_params generic_armv8_a_tunings =
>>  tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>  (AARCH64_EXTRA_TUNE_BASE
>>   | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>  &generic_prefetch_tune,
>>  AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>> index 785e00946bc..7b5821183bc 100644
>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>> @@ -251,7 +251,6 @@ static const struct tune_params generic_armv9_a_tunings =
>>  0,    /* max_case_values.  */
>>  tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>  (AARCH64_EXTRA_TUNE_BASE
>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>  &generic_armv9a_prefetch_tune,
>>  AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h 
>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>> index 007f987154c..f7457df59e5 100644
>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>> @@ -156,7 +156,6 @@ static const struct tune_params neoverse512tvb_tunings =
>>  0,    /* max_case_values.  */
>>  tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>  (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>  &generic_armv9a_prefetch_tune,
>>  AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
>> b/gcc/config/aarch64/tuning_models/neoversen2.h
>> index 32560d2f5f8..541b61c8179 100644
>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings =
>>  tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>  (AARCH64_EXTRA_TUNE_BASE
>>   | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),    /* tune_flags.  */
>>  &generic_armv9a_prefetch_tune,
>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h 
>> b/gcc/config/aarch64/tuning_models/neoversen3.h
>> index 2010bc4645b..eff668132a8 100644
>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings =
>>  tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>  (AARCH64_EXTRA_TUNE_BASE
>>   | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>>  &generic_armv9a_prefetch_tune,
>>  AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
>> b/gcc/config/aarch64/tuning_models/neoversev1.h
>> index c3751e32696..d11472b6e1e 100644
>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
>> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings =
>>  tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>  (AARCH64_EXTRA_TUNE_BASE
>>   | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),    /* tune_flags.  */
>>  &generic_armv9a_prefetch_tune,
>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
>> b/gcc/config/aarch64/tuning_models/neoversev2.h
>> index 80dbe5c806c..ee77ffdd3bc 100644
>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev2_tunings =
>>  tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>  (AARCH64_EXTRA_TUNE_BASE
>>   | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
>>   | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),    /* tune_flags.  */
>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h 
>> b/gcc/config/aarch64/tuning_models/neoversev3.h
>> index efe09e16d1e..6ef143ef7d5 100644
>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings =
>>  tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>  (AARCH64_EXTRA_TUNE_BASE
>>   | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),    /* tune_flags.  */
>>  &generic_armv9a_prefetch_tune,
>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h 
>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>> index 66849f30889..96bdbf971f1 100644
>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings =
>>  tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>>  (AARCH64_EXTRA_TUNE_BASE
>>   | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),    /* tune_flags.  */
>>  &generic_armv9a_prefetch_tune,
>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c 
>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>> index 762805ff54b..c334b7a6875 100644
>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>> @@ -15,4 +15,4 @@
>>   so we vectorize the offset calculation.  This means that the
>>   64-bit version needs two copies.  */
>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, 
>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c 
>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>> index f0ea58e38e2..94cc63049bc 100644
>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>> @@ -15,4 +15,4 @@
>>   so we vectorize the offset calculation.  This means that the
>>   64-bit version needs two copies.  */
>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], \[x[0-9]+, 
>> z[0-9]+.s, uxtw 2\]\n} 3 } } */
>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
>> index be1139a423c..ab57163c243 100644
>> --- a/gcc/tree-vect-stmts.cc
>> +++ b/gcc/tree-vect-stmts.cc
>> @@ -8834,19 +8834,8 @@ vectorizable_store (vec_info *vinfo,
>>       {
>>         if (costing_p)
>>           {
>> -              /* Only need vector extracting when there are more
>> -             than one stores.  */
>> -              if (nstores > 1)
>> -            inside_cost
>> -              += record_stmt_cost (cost_vec, 1, vec_to_scalar,
>> -                           stmt_info, slp_node,
>> -                           0, vect_body);
>> -              /* Take a single lane vector type store as scalar
>> -             store to avoid ICE like 110776.  */
>> -              if (VECTOR_TYPE_P (ltype)
>> -              && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
>> -            n_adjacent_stores++;
>> -              else
>> +              n_adjacent_stores++;
>> +              if (!VECTOR_TYPE_P (ltype))
> 
> This should be combined with the Single lane Vector case belle
> 
>>           inside_cost
>>             += record_stmt_cost (cost_vec, 1, scalar_store,
>>                          stmt_info, 0, vect_body);
>> @@ -8905,9 +8894,25 @@ vectorizable_store (vec_info *vinfo,
>>      if (costing_p)
>>   {
>>     if (n_adjacent_stores > 0)
>> -        vect_get_store_cost (vinfo, stmt_info, slp_node, n_adjacent_stores,
>> -                 alignment_support_scheme, misalignment,
>> -                 &inside_cost, cost_vec);
>> +        {
>> +          /* Take a single lane vector type store as scalar
>> +         store to avoid ICE like 110776.  */
>> +          if (VECTOR_TYPE_P (ltype)
>> +          && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
>> +        inside_cost
>> +          += record_stmt_cost (cost_vec, n_adjacent_stores,
>> +                       scalar_store, stmt_info, 0, vect_body);
>> +          /* Only need vector extracting when there are more
>> +         than one stores.  */
>> +          if (nstores > 1)
>> +        inside_cost
>> +          += record_stmt_cost (cost_vec, n_adjacent_stores,
>> +                       vec_to_scalar, stmt_info, slp_node,
>> +                       0, vect_body);
>> +          vect_get_store_cost (vinfo, stmt_info, slp_node,
> 
> This should be Inlay done for Multi-lane vectors
Thanks for the quick reply. As I am making the changes, I am wondering: Do we 
even need n_adjacent_stores anymore? It appears to always have the same value 
as nstores. Can we remove it and use nstores instead or does it still serve 
another purpose?
Thanks, Jennifer
> 
>> +                   n_adjacent_stores, alignment_support_scheme,
>> +                   misalignment, &inside_cost, cost_vec);
>> +        }
>>     if (dump_enabled_p ())
>>       dump_printf_loc (MSG_NOTE, vect_location,
>>                "vect_model_store_cost: inside_cost = %d, "
>> --
>> 2.34.1
>>> 
>>>> +             inside_cost
>>>> +               += record_stmt_cost (cost_vec, n_adjacent_stores, 
>>>> vec_to_scalar,
>>>> +                                    stmt_info, slp_node,
>>>> +                                    0, vect_body);
>>>> +           }
>>>>        if (dump_enabled_p ())
>>>>          dump_printf_loc (MSG_NOTE, vect_location,
>>>>                           "vect_model_store_cost: inside_cost = %d, "
>>>> --
>>>> 2.44.0
>>>> 
>>>> 
>>>>>> 
>>>>>> Richard
>>>>>> 
>>>>>>> Thanks,
>>>>>>> Jennifer
>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Jennifer
>>>>>>>>> 
>>>>>>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS 
>>>>>>>>> tunable and
>>>>>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
>>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
>>>>>>>>> default. To that end, the function aarch64_use_new_vector_costs_p and 
>>>>>>>>> its uses
>>>>>>>>> were removed. To prevent costing vec_to_scalar operations with 0, as
>>>>>>>>> described in
>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
>>>>>>>>> we guarded the call to vect_is_store_elt_extraction in
>>>>>>>>> aarch64_vector_costs::add_stmt_cost by count > 1.
>>>>>>>>> 
>>>>>>>>> Two tests were adjusted due to changes in codegen. In both cases, the
>>>>>>>>> old code performed loop unrolling once, but the new code does not:
>>>>>>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
>>>>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>>>>>>>>> -moverride=tune=none):
>>>>>>>>> f_int64_t_32:
>>>>>>>>>   cbz     w3, .L92
>>>>>>>>>   mov     x4, 0
>>>>>>>>>   uxtw    x3, w3
>>>>>>>>> +       cntd    x5
>>>>>>>>> +       whilelo p7.d, xzr, x3
>>>>>>>>> +       mov     z29.s, w5
>>>>>>>>>   mov     z31.s, w2
>>>>>>>>> -       whilelo p6.d, xzr, x3
>>>>>>>>> -       mov     x2, x3
>>>>>>>>> -       index   z30.s, #0, #1
>>>>>>>>> -       uqdecd  x2
>>>>>>>>> -       ptrue   p5.b, all
>>>>>>>>> -       whilelo p7.d, xzr, x2
>>>>>>>>> +       index   z30.d, #0, #1
>>>>>>>>> +       ptrue   p6.b, all
>>>>>>>>>   .p2align 3,,7
>>>>>>>>> .L94:
>>>>>>>>> -       ld1d    z27.d, p7/z, [x0, #1, mul vl]
>>>>>>>>> -       ld1d    z28.d, p6/z, [x0]
>>>>>>>>> -       movprfx z29, z31
>>>>>>>>> -       mul     z29.s, p5/m, z29.s, z30.s
>>>>>>>>> -       incw    x4
>>>>>>>>> -       uunpklo z0.d, z29.s
>>>>>>>>> -       uunpkhi z29.d, z29.s
>>>>>>>>> -       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
>>>>>>>>> -       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
>>>>>>>>> -       add     z25.d, z28.d, z25.d
>>>>>>>>> +       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
>>>>>>>>> +       movprfx z28, z31
>>>>>>>>> +       mul     z28.s, p6/m, z28.s, z30.s
>>>>>>>>> +       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
>>>>>>>>>   add     z26.d, z27.d, z26.d
>>>>>>>>> -       st1d    z26.d, p7, [x0, #1, mul vl]
>>>>>>>>> -       whilelo p7.d, x4, x2
>>>>>>>>> -       st1d    z25.d, p6, [x0]
>>>>>>>>> -       incw    z30.s
>>>>>>>>> -       incb    x0, all, mul #2
>>>>>>>>> -       whilelo p6.d, x4, x3
>>>>>>>>> +       st1d    z26.d, p7, [x0, x4, lsl 3]
>>>>>>>>> +       add     z30.s, z30.s, z29.s
>>>>>>>>> +       incd    x4
>>>>>>>>> +       whilelo p7.d, x4, x3
>>>>>>>>>   b.any   .L94
>>>>>>>>> .L92:
>>>>>>>>>   ret
>>>>>>>>> 
>>>>>>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
>>>>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>>>>>>>>> -moverride=tune=none):
>>>>>>>>> f_int64_t_32:
>>>>>>>>>   cbz     w3, .L84
>>>>>>>>> -       addvl   x5, x1, #1
>>>>>>>>>   mov     x4, 0
>>>>>>>>>   uxtw    x3, w3
>>>>>>>>> -       mov     z31.s, w2
>>>>>>>>> +       cntd    x5
>>>>>>>>>   whilelo p7.d, xzr, x3
>>>>>>>>> -       mov     x2, x3
>>>>>>>>> -       index   z30.s, #0, #1
>>>>>>>>> -       uqdecd  x2
>>>>>>>>> -       ptrue   p5.b, all
>>>>>>>>> -       whilelo p6.d, xzr, x2
>>>>>>>>> +       mov     z29.s, w5
>>>>>>>>> +       mov     z31.s, w2
>>>>>>>>> +       index   z30.d, #0, #1
>>>>>>>>> +       ptrue   p6.b, all
>>>>>>>>>   .p2align 3,,7
>>>>>>>>> .L86:
>>>>>>>>> -       ld1d    z28.d, p7/z, [x1, x4, lsl 3]
>>>>>>>>> -       ld1d    z27.d, p6/z, [x5, x4, lsl 3]
>>>>>>>>> -       movprfx z29, z30
>>>>>>>>> -       mul     z29.s, p5/m, z29.s, z31.s
>>>>>>>>> -       add     z28.d, z28.d, #1
>>>>>>>>> -       uunpklo z26.d, z29.s
>>>>>>>>> -       st1d    z28.d, p7, [x0, z26.d, lsl 3]
>>>>>>>>> -       incw    x4
>>>>>>>>> -       uunpkhi z29.d, z29.s
>>>>>>>>> +       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
>>>>>>>>> +       movprfx z28, z30
>>>>>>>>> +       mul     z28.s, p6/m, z28.s, z31.s
>>>>>>>>>   add     z27.d, z27.d, #1
>>>>>>>>> -       whilelo p6.d, x4, x2
>>>>>>>>> -       st1d    z27.d, p7, [x0, z29.d, lsl 3]
>>>>>>>>> -       incw    z30.s
>>>>>>>>> +       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
>>>>>>>>> +       incd    x4
>>>>>>>>> +       add     z30.s, z30.s, z29.s
>>>>>>>>>   whilelo p7.d, x4, x3
>>>>>>>>>   b.any   .L86
>>>>>>>>> .L84:
>>>>>>>>> ret
>>>>>>>>> 
>>>>>>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no
>>>>>>>>> regression. We also ran SPEC2017 with -mcpu=generic on a Grace 
>>>>>>>>> machine and saw
>>>>>>>>> no non-noise impact on performance. We would appreciate help with 
>>>>>>>>> wider
>>>>>>>>> benchmarking on other platforms, if necessary.
>>>>>>>>> OK for mainline?
>>>>>>>>> 
>>>>>>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>>>>>>>> 
>>>>>>>>> gcc/
>>>>>>>>> * config/aarch64/aarch64-tuning-flags.def: Remove
>>>>>>>>> use_new_vector_costs as tuning option.
>>>>>>>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
>>>>>>>>> Remove.
>>>>>>>>> (aarch64_vector_costs::add_stmt_cost): Remove use of
>>>>>>>>> aarch64_use_new_vector_costs_p and guard call to
>>>>>>>>> vect_is_store_elt_extraction with count > 1.
>>>>>>>>> (aarch64_vector_costs::finish_cost): Remove use of
>>>>>>>>> aarch64_use_new_vector_costs_p.
>>>>>>>>> * config/aarch64/tuning_models/cortexx925.h: Remove
>>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
>>>>>>>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
>>>>>>>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
>>>>>>>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
>>>>>>>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
>>>>>>>>> * config/aarch64/tuning_models/neoversen2.h: Likewise.
>>>>>>>>> * config/aarch64/tuning_models/neoversen3.h: Likewise.
>>>>>>>>> * config/aarch64/tuning_models/neoversev1.h: Likewise.
>>>>>>>>> * config/aarch64/tuning_models/neoversev2.h: Likewise.
>>>>>>>>> * config/aarch64/tuning_models/neoversev3.h: Likewise.
>>>>>>>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
>>>>>>>>> 
>>>>>>>>> gcc/testsuite/
>>>>>>>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
>>>>>>>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
>>>>>>>>> ---
>>>>>>>>> gcc/config/aarch64/aarch64-tuning-flags.def   |  2 --
>>>>>>>>> gcc/config/aarch64/aarch64.cc                 | 22 +++++--------------
>>>>>>>>> gcc/config/aarch64/tuning_models/cortexx925.h |  1 -
>>>>>>>>> .../aarch64/tuning_models/fujitsu_monaka.h    |  1 -
>>>>>>>>> .../aarch64/tuning_models/generic_armv8_a.h   |  1 -
>>>>>>>>> .../aarch64/tuning_models/generic_armv9_a.h   |  1 -
>>>>>>>>> .../aarch64/tuning_models/neoverse512tvb.h    |  1 -
>>>>>>>>> gcc/config/aarch64/tuning_models/neoversen2.h |  1 -
>>>>>>>>> gcc/config/aarch64/tuning_models/neoversen3.h |  1 -
>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev1.h |  1 -
>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev2.h |  1 -
>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev3.h |  1 -
>>>>>>>>> .../aarch64/tuning_models/neoversev3ae.h      |  1 -
>>>>>>>>> .../gcc.target/aarch64/sve/strided_load_2.c   |  2 +-
>>>>>>>>> .../gcc.target/aarch64/sve/strided_store_2.c  |  2 +-
>>>>>>>>> 15 files changed, 7 insertions(+), 32 deletions(-)
>>>>>>>>> 
>>>>>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
>>>>>>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>>>>> index 5939602576b..ed345b13ed3 100644
>>>>>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", 
>>>>>>>>> CHEAP_SHIFT_EXTEND)
>>>>>>>>> 
>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", 
>>>>>>>>> CSE_SVE_VL_CONSTANTS)
>>>>>>>>> 
>>>>>>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", 
>>>>>>>>> USE_NEW_VECTOR_COSTS)
>>>>>>>>> -
>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", 
>>>>>>>>> MATCHED_VECTOR_THROUGHPUT)
>>>>>>>>> 
>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", 
>>>>>>>>> AVOID_CROSS_LOOP_FMA)
>>>>>>>>> diff --git a/gcc/config/aarch64/aarch64.cc 
>>>>>>>>> b/gcc/config/aarch64/aarch64.cc
>>>>>>>>> index 43238aefef2..03806671c97 100644
>>>>>>>>> --- a/gcc/config/aarch64/aarch64.cc
>>>>>>>>> +++ b/gcc/config/aarch64/aarch64.cc
>>>>>>>>> @@ -16566,16 +16566,6 @@ aarch64_vectorize_create_costs (vec_info 
>>>>>>>>> *vinfo, bool costing_for_scalar)
>>>>>>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar);
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> -/* Return true if the current CPU should use the new costs defined
>>>>>>>>> -   in GCC 11.  This should be removed for GCC 12 and above, with the
>>>>>>>>> -   costs applying to all CPUs instead.  */
>>>>>>>>> -static bool
>>>>>>>>> -aarch64_use_new_vector_costs_p ()
>>>>>>>>> -{
>>>>>>>>> -  return (aarch64_tune_params.extra_tuning_flags
>>>>>>>>> -       & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
>>>>>>>>> -}
>>>>>>>>> -
>>>>>>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE.  */
>>>>>>>>> static const simd_vec_cost *
>>>>>>>>> aarch64_simd_vec_costs (tree vectype)
>>>>>>>>> @@ -17494,7 +17484,7 @@ aarch64_vector_costs::add_stmt_cost (int 
>>>>>>>>> count, vect_cost_for_stmt kind,
>>>>>>>>> 
>>>>>>>>> /* Do one-time initialization based on the vinfo.  */
>>>>>>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
>>>>>>>>> -  if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
>>>>>>>>> +  if (!m_analyzed_vinfo)
>>>>>>>>> {
>>>>>>>>>  if (loop_vinfo)
>>>>>>>>> analyze_loop_vinfo (loop_vinfo);
>>>>>>>>> @@ -17512,12 +17502,12 @@ aarch64_vector_costs::add_stmt_cost (int 
>>>>>>>>> count, vect_cost_for_stmt kind,
>>>>>>>>> 
>>>>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
>>>>>>>>> of just looking at KIND.  */
>>>>>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>>>>>> +  if (stmt_info)
>>>>>>>>> {
>>>>>>>>>  /* If we scalarize a strided store, the vectorizer costs one
>>>>>>>>>  vec_to_scalar for each element.  However, we can store the first
>>>>>>>>>  element using an FP store without a separate extract step.  */
>>>>>>>>> -      if (vect_is_store_elt_extraction (kind, stmt_info))
>>>>>>>>> +      if (vect_is_store_elt_extraction (kind, stmt_info) && count > 
>>>>>>>>> 1)
>>>>>>>>> count -= 1;
>>>>>>>>> 
>>>>>>>>>  stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
>>>>>>>>> @@ -17577,7 +17567,7 @@ aarch64_vector_costs::add_stmt_cost (int 
>>>>>>>>> count, vect_cost_for_stmt kind,
>>>>>>>>> else
>>>>>>>>> m_num_last_promote_demote = 0;
>>>>>>>>> 
>>>>>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>>>>>> +  if (stmt_info)
>>>>>>>>> {
>>>>>>>>>  /* Account for any extra "embedded" costs that apply additively
>>>>>>>>>  to the base cost calculated above.  */
>>>>>>>>> @@ -17938,9 +17928,7 @@ aarch64_vector_costs::finish_cost (const 
>>>>>>>>> vector_costs *uncast_scalar_costs)
>>>>>>>>> 
>>>>>>>>> auto *scalar_costs
>>>>>>>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
>>>>>>>>> -  if (loop_vinfo
>>>>>>>>> -      && m_vec_flags
>>>>>>>>> -      && aarch64_use_new_vector_costs_p ())
>>>>>>>>> +  if (loop_vinfo && m_vec_flags)
>>>>>>>>> {
>>>>>>>>>  m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>>>>>>>>                                      m_costs[vect_body]);
>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
>>>>>>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>>>>>> index eb9b89984b0..dafea96e924 100644
>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
>>>>>>>>> cortexx925_tunings =
>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h 
>>>>>>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>>>>>> index 6a098497759..ac001927959 100644
>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>>>>>> @@ -55,7 +55,6 @@ static const struct tune_params 
>>>>>>>>> fujitsu_monaka_tunings =
>>>>>>>>> 0, /* max_case_values.  */
>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h 
>>>>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>>>>>> index 9b1cbfc5bd2..7b534831340 100644
>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>>>>>> @@ -183,7 +183,6 @@ static const struct tune_params 
>>>>>>>>> generic_armv8_a_tunings =
>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
>>>>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>>>>>> index 48353a59939..562ef89c67b 100644
>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>>>>>> @@ -249,7 +249,6 @@ static const struct tune_params 
>>>>>>>>> generic_armv9_a_tunings =
>>>>>>>>> 0, /* max_case_values.  */
>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>>>>>> &generic_armv9a_prefetch_tune,
>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h 
>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>>>>>> index c407b89a22f..fe4f7c10f73 100644
>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>>>>>> @@ -156,7 +156,6 @@ static const struct tune_params 
>>>>>>>>> neoverse512tvb_tunings =
>>>>>>>>> 0, /* max_case_values.  */
>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>>>>>> index 18199ac206c..56be77423cb 100644
>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
>>>>>>>>> neoversen2_tunings =
>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h 
>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>>>>>> index 4da85cfac0d..254ad5e27f8 100644
>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
>>>>>>>>> neoversen3_tunings =
>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>>>>>> index dd9120eee48..c7241cf23d7 100644
>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>>>>>> @@ -227,7 +227,6 @@ static const struct tune_params 
>>>>>>>>> neoversev1_tunings =
>>>>>>>>> 0, /* max_case_values.  */
>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>> | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>>>>>> index 1369de73991..96f55940649 100644
>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>>>>>> @@ -232,7 +232,6 @@ static const struct tune_params 
>>>>>>>>> neoversev2_tunings =
>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
>>>>>>>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),        /* tune_flags.  */
>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h 
>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>>>>>> index d8c82255378..f62ae67d355 100644
>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
>>>>>>>>> neoversev3_tunings =
>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h 
>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>>>>>> index 7f050501ede..0233baf5e34 100644
>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
>>>>>>>>> neoversev3ae_tunings =
>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c 
>>>>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>>>>>> index 762805ff54b..c334b7a6875 100644
>>>>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>>>>>> @@ -15,4 +15,4 @@
>>>>>>>>> so we vectorize the offset calculation.  This means that the
>>>>>>>>> 64-bit version needs two copies.  */
>>>>>>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, 
>>>>>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>>>>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
>>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>>>>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
>>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>>>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c 
>>>>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>>>>>> index f0ea58e38e2..94cc63049bc 100644
>>>>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>>>>>> @@ -15,4 +15,4 @@
>>>>>>>>> so we vectorize the offset calculation.  This means that the
>>>>>>>>> 64-bit version needs two copies.  */
>>>>>>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], 
>>>>>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>>>>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>>>>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Richard Biener <rguent...@suse.de>
>>>>>>>> SUSE Software Solutions Germany GmbH,
>>>>>>>> Frankenstrasse 146, 90461 Nuernberg, Germany;
>>>>>>>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG 
>>>>>>>> Nuernberg)
smime.p7s
Description: S/MIME cryptographic signature
Re: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

Reply via email to