Re: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

Richard Biener Fri, 13 Dec 2024 04:41:09 -0800

On Thu, Dec 12, 2024 at 5:27 PM Jennifer Schmitz <jschm...@nvidia.com> wrote:
>
>
>
> > On 6 Dec 2024, at 08:41, Jennifer Schmitz <jschm...@nvidia.com> wrote:
> >
> >
> >
> >> On 5 Dec 2024, at 20:07, Richard Sandiford <richard.sandif...@arm.com> 
> >> wrote:
> >>
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> Jennifer Schmitz <jschm...@nvidia.com> writes:
> >>>> On 5 Dec 2024, at 11:44, Richard Biener <rguent...@suse.de> wrote:
> >>>>
> >>>> External email: Use caution opening links or attachments
> >>>>
> >>>>
> >>>> On Thu, 5 Dec 2024, Jennifer Schmitz wrote:
> >>>>
> >>>>>
> >>>>>
> >>>>>> On 17 Oct 2024, at 19:23, Richard Sandiford 
> >>>>>> <richard.sandif...@arm.com> wrote:
> >>>>>>
> >>>>>> External email: Use caution opening links or attachments
> >>>>>>
> >>>>>>
> >>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes:
> >>>>>>> [...]
> >>>>>>> Looking at the diff of the vect dumps (below is a section of the diff 
> >>>>>>> for strided_store_2.c), it seemed odd that vec_to_scalar operations 
> >>>>>>> cost 0 now, instead of the previous cost of 2:
> >>>>>>>
> >>>>>>> +strided_store_1.c:38:151: note:    === vectorizable_operation ===
> >>>>>>> +strided_store_1.c:38:151: note:    vect_model_simple_cost: 
> >>>>>>> inside_cost = 1, prologue_cost  = 0 .
> >>>>>>> +strided_store_1.c:38:151: note:   ==> examining statement: *_6 = _7;
> >>>>>>> +strided_store_1.c:38:151: note:   vect_is_simple_use: operand _3 + 
> >>>>>>> 1.0e+0, type of def:    internal
> >>>>>>> +strided_store_1.c:38:151: note:   Vectorizing an unaligned access.
> >>>>>>> +Applying pattern match.pd:236, generic-match-9.cc:4128
> >>>>>>> +Applying pattern match.pd:5285, generic-match-10.cc:4234
> >>>>>>> +strided_store_1.c:38:151: note:   vect_model_store_cost: inside_cost 
> >>>>>>> = 12, prologue_cost = 0 .
> >>>>>>> *_2 1 times unaligned_load (misalign -1) costs 1 in body
> >>>>>>> -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue
> >>>>>>> _3 + 1.0e+0 1 times vector_stmt costs 1 in body
> >>>>>>> -_7 1 times vec_to_scalar costs 2 in body
> >>>>>>> +<unknown> 1 times vector_load costs 1 in prologue
> >>>>>>> +_7 1 times vec_to_scalar costs 0 in body
> >>>>>>> _7 1 times scalar_store costs 1 in body
> >>>>>>> -_7 1 times vec_to_scalar costs 2 in body
> >>>>>>> +_7 1 times vec_to_scalar costs 0 in body
> >>>>>>> _7 1 times scalar_store costs 1 in body
> >>>>>>> -_7 1 times vec_to_scalar costs 2 in body
> >>>>>>> +_7 1 times vec_to_scalar costs 0 in body
> >>>>>>> _7 1 times scalar_store costs 1 in body
> >>>>>>> -_7 1 times vec_to_scalar costs 2 in body
> >>>>>>> +_7 1 times vec_to_scalar costs 0 in body
> >>>>>>> _7 1 times scalar_store costs 1 in body
> >>>>>>>
> >>>>>>> Although the aarch64_use_new_vector_costs_p flag was used in multiple 
> >>>>>>> places in aarch64.cc, the location that causes this behavior is this 
> >>>>>>> one:
> >>>>>>> unsigned
> >>>>>>> aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt 
> >>>>>>> kind,
> >>>>>>>                                  stmt_vec_info stmt_info, slp_tree,
> >>>>>>>                                  tree vectype, int misalign,
> >>>>>>>                                  vect_cost_model_location where)
> >>>>>>> {
> >>>>>>> [...]
> >>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
> >>>>>>>  of just looking at KIND.  */
> >>>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
> >>>>>>> +  if (stmt_info)
> >>>>>>> {
> >>>>>>>   /* If we scalarize a strided store, the vectorizer costs one
> >>>>>>>      vec_to_scalar for each element.  However, we can store the first
> >>>>>>>      element using an FP store without a separate extract step.  */
> >>>>>>>   if (vect_is_store_elt_extraction (kind, stmt_info))
> >>>>>>>     count -= 1;
> >>>>>>>
> >>>>>>>   stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
> >>>>>>>                                                   stmt_info, 
> >>>>>>> stmt_cost);
> >>>>>>>
> >>>>>>>   if (vectype && m_vec_flags)
> >>>>>>>     stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind,
> >>>>>>>                                                     stmt_info, 
> >>>>>>> vectype,
> >>>>>>>                                                     where, stmt_cost);
> >>>>>>> }
> >>>>>>> [...]
> >>>>>>> return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil 
> >>>>>>> ());
> >>>>>>> }
> >>>>>>>
> >>>>>>> Previously, for mtune=generic, this function returned a cost of 2 for 
> >>>>>>> a vec_to_scalar operation in the vect body. Now "if (stmt_info)" is 
> >>>>>>> entered and "if (vect_is_store_elt_extraction (kind, stmt_info))" 
> >>>>>>> evaluates to true, which sets the count to 0 and leads to a return 
> >>>>>>> value of 0.
> >>>>>>
> >>>>>> At the time the code was written, a scalarised store would be costed
> >>>>>> using one vec_to_scalar call into the backend, with the count parameter
> >>>>>> set to the number of elements being stored.  The "count -= 1" was
> >>>>>> supposed to lop off the leading element extraction, since we can store
> >>>>>> lane 0 as a normal FP store.
> >>>>>>
> >>>>>> The target-independent costing was later reworked so that it costs
> >>>>>> each operation individually:
> >>>>>>
> >>>>>>           for (i = 0; i < nstores; i++)
> >>>>>>             {
> >>>>>>               if (costing_p)
> >>>>>>                 {
> >>>>>>                   /* Only need vector extracting when there are more
> >>>>>>                      than one stores.  */
> >>>>>>                   if (nstores > 1)
> >>>>>>                     inside_cost
> >>>>>>                       += record_stmt_cost (cost_vec, 1, vec_to_scalar,
> >>>>>>                                            stmt_info, 0, vect_body);
> >>>>>>                   /* Take a single lane vector type store as scalar
> >>>>>>                      store to avoid ICE like 110776.  */
> >>>>>>                   if (VECTOR_TYPE_P (ltype)
> >>>>>>                       && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
> >>>>>>                     n_adjacent_stores++;
> >>>>>>                   else
> >>>>>>                     inside_cost
> >>>>>>                       += record_stmt_cost (cost_vec, 1, scalar_store,
> >>>>>>                                            stmt_info, 0, vect_body);
> >>>>>>                   continue;
> >>>>>>                 }
> >>>>>>
> >>>>>> Unfortunately, there's no easy way of telling whether a particular call
> >>>>>> is part of a group, and if so, which member of the group it is.
> >>>>>>
> >>>>>> I suppose we could give up on the attempt to be (somewhat) accurate
> >>>>>> and just disable the optimisation.  Or we could restrict it to count > 
> >>>>>> 1,
> >>>>>> since it might still be useful for gathers and scatters.
> >>>>> I tried restricting the calls to vect_is_store_elt_extraction to count 
> >>>>> > 1 and it seems to resolve the issue of costing vec_to_scalar 
> >>>>> operations with 0 (see patch below).
> >>>>> What are your thoughts on this?
> >>>>
> >>>> Why didn't you pursue instead moving the vec_to_scalar cost together
> >>>> with the n_adjacent_store handling?
> >>> When I continued working on this patch, we had already reached stage 3 
> >>> and I was hesitant to introduce changes to the middle-end that were not 
> >>> previously covered by this patch. So I tried if the issue could not be 
> >>> resolved by making a small change in the backend.
> >>> If you still advise to use the n_adjacent_store instead, I’m happy to 
> >>> look into it again.
> >>
> >> If Richard's ok with adjusting vectorizable_store for GCC 15 (which it
> >> sounds like he is), then I agree that would be better.  Otherwise we'd
> >> be creating technical debt to clean up for GCC 16.  And it is a regression
> >> of sorts, so is stage 3 material from that POV.
> >>
> >> (Incidentally, AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS was itself a
> >> "let's clean this up next stage 1" thing, since we needed to add tuning
> >> for a new CPU late during the cycle.  But of course, there were other
> >> priorities when stage 1 actually came around, so it never actually
> >> happened.  Thanks again for being the one to sort this out.)
> > Thanks for your feedback. Then I will try to make it work in 
> > vectorizable_store.
> > Best,
> > Jennifer
> Below is the updated patch with a suggestion for the changes in 
> vectorizable_store. It resolves the issue with the vec_to_scalar operations 
> that were individually costed with 0.
> We already tested it on aarch64, no regression, but we are still doing 
> performance testing.
> Can you give some feedback in the meantime on the patch itself?
> Thanks,
> Jennifer
>
>
> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and
> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
> default. To that end, the function aarch64_use_new_vector_costs_p and its uses
> were removed. To prevent costing vec_to_scalar operations with 0, as
> described in
> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
> we adjusted vectorizable_store such that the variable n_adjacent_stores
> also covers vec_to_scalar operations. This way vec_to_scalar operations
> are not costed individually, but as a group.
>
> Two tests were adjusted due to changes in codegen. In both cases, the
> old code performed loop unrolling once, but the new code does not:
> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
> -moverride=tune=none):
> f_int64_t_32:
>         cbz     w3, .L92
>         mov     x4, 0
>         uxtw    x3, w3
> +       cntd    x5
> +       whilelo p7.d, xzr, x3
> +       mov     z29.s, w5
>         mov     z31.s, w2
> -       whilelo p6.d, xzr, x3
> -       mov     x2, x3
> -       index   z30.s, #0, #1
> -       uqdecd  x2
> -       ptrue   p5.b, all
> -       whilelo p7.d, xzr, x2
> +       index   z30.d, #0, #1
> +       ptrue   p6.b, all
>         .p2align 3,,7
>  .L94:
> -       ld1d    z27.d, p7/z, [x0, #1, mul vl]
> -       ld1d    z28.d, p6/z, [x0]
> -       movprfx z29, z31
> -       mul     z29.s, p5/m, z29.s, z30.s
> -       incw    x4
> -       uunpklo z0.d, z29.s
> -       uunpkhi z29.d, z29.s
> -       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
> -       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
> -       add     z25.d, z28.d, z25.d
> +       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
> +       movprfx z28, z31
> +       mul     z28.s, p6/m, z28.s, z30.s
> +       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
>         add     z26.d, z27.d, z26.d
> -       st1d    z26.d, p7, [x0, #1, mul vl]
> -       whilelo p7.d, x4, x2
> -       st1d    z25.d, p6, [x0]
> -       incw    z30.s
> -       incb    x0, all, mul #2
> -       whilelo p6.d, x4, x3
> +       st1d    z26.d, p7, [x0, x4, lsl 3]
> +       add     z30.s, z30.s, z29.s
> +       incd    x4
> +       whilelo p7.d, x4, x3
>         b.any   .L94
>  .L92:
>         ret
>
> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
> -moverride=tune=none):
> f_int64_t_32:
>         cbz     w3, .L84
> -       addvl   x5, x1, #1
>         mov     x4, 0
>         uxtw    x3, w3
> -       mov     z31.s, w2
> +       cntd    x5
>         whilelo p7.d, xzr, x3
> -       mov     x2, x3
> -       index   z30.s, #0, #1
> -       uqdecd  x2
> -       ptrue   p5.b, all
> -       whilelo p6.d, xzr, x2
> +       mov     z29.s, w5
> +       mov     z31.s, w2
> +       index   z30.d, #0, #1
> +       ptrue   p6.b, all
>         .p2align 3,,7
>  .L86:
> -       ld1d    z28.d, p7/z, [x1, x4, lsl 3]
> -       ld1d    z27.d, p6/z, [x5, x4, lsl 3]
> -       movprfx z29, z30
> -       mul     z29.s, p5/m, z29.s, z31.s
> -       add     z28.d, z28.d, #1
> -       uunpklo z26.d, z29.s
> -       st1d    z28.d, p7, [x0, z26.d, lsl 3]
> -       incw    x4
> -       uunpkhi z29.d, z29.s
> +       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
> +       movprfx z28, z30
> +       mul     z28.s, p6/m, z28.s, z31.s
>         add     z27.d, z27.d, #1
> -       whilelo p6.d, x4, x2
> -       st1d    z27.d, p7, [x0, z29.d, lsl 3]
> -       incw    z30.s
> +       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
> +       incd    x4
> +       add     z30.s, z30.s, z29.s
>         whilelo p7.d, x4, x3
>         b.any   .L86
>  .L84:
>         ret
>
> The patch was bootstrapped and tested on aarch64-linux-gnu, no
> regression.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>
> gcc/
>         * tree-vect-stmts.cc (vectorizable_store): Extend the use of
>         n_adjacent_stores to also cover vec_to_scalar operations.
>         * config/aarch64/aarch64-tuning-flags.def: Remove
>         use_new_vector_costs as tuning option.
>         * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
>         Remove.
>         (aarch64_vector_costs::add_stmt_cost): Remove use of
>         aarch64_use_new_vector_costs_p.
>         (aarch64_vector_costs::finish_cost): Remove use of
>         aarch64_use_new_vector_costs_p.
>         * config/aarch64/tuning_models/cortexx925.h: Remove
>         AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
>         * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
>         * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
>         * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
>         * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
>         * config/aarch64/tuning_models/neoversen2.h: Likewise.
>         * config/aarch64/tuning_models/neoversen3.h: Likewise.
>         * config/aarch64/tuning_models/neoversev1.h: Likewise.
>         * config/aarch64/tuning_models/neoversev2.h: Likewise.
>         * config/aarch64/tuning_models/neoversev3.h: Likewise.
>         * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
>
> gcc/testsuite/
>         * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
>         * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
> ---
>  gcc/config/aarch64/aarch64-tuning-flags.def   |  2 --
>  gcc/config/aarch64/aarch64.cc                 | 20 +++----------
>  gcc/config/aarch64/tuning_models/cortexx925.h |  1 -
>  .../aarch64/tuning_models/fujitsu_monaka.h    |  1 -
>  .../aarch64/tuning_models/generic_armv8_a.h   |  1 -
>  .../aarch64/tuning_models/generic_armv9_a.h   |  1 -
>  .../aarch64/tuning_models/neoverse512tvb.h    |  1 -
>  gcc/config/aarch64/tuning_models/neoversen2.h |  1 -
>  gcc/config/aarch64/tuning_models/neoversen3.h |  1 -
>  gcc/config/aarch64/tuning_models/neoversev1.h |  1 -
>  gcc/config/aarch64/tuning_models/neoversev2.h |  1 -
>  gcc/config/aarch64/tuning_models/neoversev3.h |  1 -
>  .../aarch64/tuning_models/neoversev3ae.h      |  1 -
>  .../gcc.target/aarch64/sve/strided_load_2.c   |  2 +-
>  .../gcc.target/aarch64/sve/strided_store_2.c  |  2 +-
>  gcc/tree-vect-stmts.cc                        | 29 ++++++++++---------
>  16 files changed, 22 insertions(+), 44 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
> b/gcc/config/aarch64/aarch64-tuning-flags.def
> index ffbff20e29c..1de633c739b 100644
> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", 
> CHEAP_SHIFT_EXTEND)
>
>  AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
>
> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS)
> -
>  AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", 
> MATCHED_VECTOR_THROUGHPUT)
>
>  AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA)
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 77a2a6bfa3a..71fba9cc63b 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo, 
> bool costing_for_scalar)
>    return new aarch64_vector_costs (vinfo, costing_for_scalar);
>  }
>
> -/* Return true if the current CPU should use the new costs defined
> -   in GCC 11.  This should be removed for GCC 12 and above, with the
> -   costs applying to all CPUs instead.  */
> -static bool
> -aarch64_use_new_vector_costs_p ()
> -{
> -  return (aarch64_tune_params.extra_tuning_flags
> -         & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
> -}
> -
>  /* Return the appropriate SIMD costs for vectors of type VECTYPE.  */
>  static const simd_vec_cost *
>  aarch64_simd_vec_costs (tree vectype)
> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
> vect_cost_for_stmt kind,
>
>    /* Do one-time initialization based on the vinfo.  */
>    loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
> -  if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
> +  if (!m_analyzed_vinfo)
>      {
>        if (loop_vinfo)
>         analyze_loop_vinfo (loop_vinfo);
> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
> vect_cost_for_stmt kind,
>
>    /* Try to get a more accurate cost by looking at STMT_INFO instead
>       of just looking at KIND.  */
> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
> +  if (stmt_info)
>      {
>        /* If we scalarize a strided store, the vectorizer costs one
>          vec_to_scalar for each element.  However, we can store the first
> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
> vect_cost_for_stmt kind,
>    else
>      m_num_last_promote_demote = 0;
>
> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
> +  if (stmt_info)
>      {
>        /* Account for any extra "embedded" costs that apply additively
>          to the base cost calculated above.  */
> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const vector_costs 
> *uncast_scalar_costs)
>
>    auto *scalar_costs
>      = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
> -  if (loop_vinfo
> -      && m_vec_flags
> -      && aarch64_use_new_vector_costs_p ())
> +  if (loop_vinfo && m_vec_flags)
>      {
>        m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>                                              m_costs[vect_body]);
> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
> b/gcc/config/aarch64/tuning_models/cortexx925.h
> index b2ff716157a..0a8eff69307 100644
> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
> @@ -219,7 +219,6 @@ static const struct tune_params cortexx925_tunings =
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_BASE
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>     | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>    &generic_prefetch_tune,
> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h 
> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> index 2d704ecd110..a564528f43d 100644
> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_BASE
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>    &generic_prefetch_tune,
>    AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h 
> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> index bdd309ab03d..f090d5cde50 100644
> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> @@ -183,7 +183,6 @@ static const struct tune_params generic_armv8_a_tunings =
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_BASE
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>    &generic_prefetch_tune,
>    AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> index a05a9ab92a2..4c33c147444 100644
> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> @@ -249,7 +249,6 @@ static const struct tune_params generic_armv9_a_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_BASE
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>    &generic_armv9a_prefetch_tune,
>    AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h 
> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> index c407b89a22f..fe4f7c10f73 100644
> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> @@ -156,7 +156,6 @@ static const struct tune_params neoverse512tvb_tunings =
>    0,   /* max_case_values.  */
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>    &generic_prefetch_tune,
>    AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
> b/gcc/config/aarch64/tuning_models/neoversen2.h
> index fd5f8f37370..0c74068da2c 100644
> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings =
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_BASE
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>     | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>    &generic_prefetch_tune,
> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h 
> b/gcc/config/aarch64/tuning_models/neoversen3.h
> index 8b156c2fe4d..9d4e1be171a 100644
> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings =
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_BASE
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
>    &generic_prefetch_tune,
>    AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
> b/gcc/config/aarch64/tuning_models/neoversev1.h
> index 23c121d8652..85a78bb2bef 100644
> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings =
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_BASE
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>     | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>    &generic_prefetch_tune,
> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
> b/gcc/config/aarch64/tuning_models/neoversev2.h
> index 40af5f47f4f..1dd452beb8d 100644
> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
> @@ -232,7 +232,6 @@ static const struct tune_params neoversev2_tunings =
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_BASE
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>     | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
>     | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),  /* tune_flags.  */
> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h 
> b/gcc/config/aarch64/tuning_models/neoversev3.h
> index d65d74bfecf..d0ba5b1aef6 100644
> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings =
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_BASE
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>     | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>    &generic_prefetch_tune,
> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h 
> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> index 7b7fa0b4b08..a1572048503 100644
> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings =
>    tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
>    (AARCH64_EXTRA_TUNE_BASE
>     | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>     | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>     | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
>    &generic_prefetch_tune,
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> index 762805ff54b..c334b7a6875 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> @@ -15,4 +15,4 @@
>     so we vectorize the offset calculation.  This means that the
>     64-bit version needs two copies.  */
>  /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, 
> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> index f0ea58e38e2..94cc63049bc 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> @@ -15,4 +15,4 @@
>     so we vectorize the offset calculation.  This means that the
>     64-bit version needs two copies.  */
>  /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], \[x[0-9]+, 
> z[0-9]+.s, uxtw 2\]\n} 3 } } */
> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[x[0-9]+, 
> z[0-9]+.d, lsl 3\]\n} 15 } } */
> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[x[0-9]+, 
> z[0-9]+.d, lsl 3\]\n} 9 } } */
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index be1139a423c..6d7d28c4702 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -8834,19 +8834,16 @@ vectorizable_store (vec_info *vinfo,
>                 {
>                   if (costing_p)
>                     {
> -                     /* Only need vector extracting when there are more
> -                        than one stores.  */
> -                     if (nstores > 1)
> -                       inside_cost
> -                         += record_stmt_cost (cost_vec, 1, vec_to_scalar,
> -                                              stmt_info, slp_node,
> -                                              0, vect_body);
>                       /* Take a single lane vector type store as scalar
>                          store to avoid ICE like 110776.  */
> -                     if (VECTOR_TYPE_P (ltype)
> -                         && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
> +                     bool single_lane_vec_p =
> +                       VECTOR_TYPE_P (ltype)
> +                       && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U);
> +                     /* Only need vector extracting when there are more
> +                        than one stores.  */
> +                     if (nstores > 1 || single_lane_vec_p)
>                         n_adjacent_stores++;
> -                     else
> +                     if (!single_lane_vec_p)


I think it's somewhat non-obvious that nstores > 1 and single_lane_vec_p
correlate.  In fact I think that we always record a store, just for
single-element
vectors we record scalar stores.  I suggest to here always to just
n_adjacent_stores++
and below ...

>                         inside_cost
>                           += record_stmt_cost (cost_vec, 1, scalar_store,
>                                                stmt_info, 0, vect_body);
> @@ -8905,9 +8902,15 @@ vectorizable_store (vec_info *vinfo,
>        if (costing_p)
>         {
>           if (n_adjacent_stores > 0)
> -           vect_get_store_cost (vinfo, stmt_info, slp_node, 
> n_adjacent_stores,
> -                                alignment_support_scheme, misalignment,
> -                                &inside_cost, cost_vec);
> +           {
> +             vect_get_store_cost (vinfo, stmt_info, slp_node, 
> n_adjacent_stores,
> +                                  alignment_support_scheme, misalignment,
> +                                  &inside_cost, cost_vec);

... record n_adjacent_stores scalar_store when ltype is single-lane and record
n_adjacent_stores vect_to_scalar if nstores > 1 (and else none).

Richard.

> +             inside_cost
> +               += record_stmt_cost (cost_vec, n_adjacent_stores, 
> vec_to_scalar,
> +                                    stmt_info, slp_node,
> +                                    0, vect_body);
> +           }
>           if (dump_enabled_p ())
>             dump_printf_loc (MSG_NOTE, vect_location,
>                              "vect_model_store_cost: inside_cost = %d, "
> --
> 2.44.0
>
>
> >>
> >> Richard
> >>
> >>> Thanks,
> >>> Jennifer
> >>>>
> >>>>> Thanks,
> >>>>> Jennifer
> >>>>>
> >>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable 
> >>>>> and
> >>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
> >>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
> >>>>> default. To that end, the function aarch64_use_new_vector_costs_p and 
> >>>>> its uses
> >>>>> were removed. To prevent costing vec_to_scalar operations with 0, as
> >>>>> described in
> >>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
> >>>>> we guarded the call to vect_is_store_elt_extraction in
> >>>>> aarch64_vector_costs::add_stmt_cost by count > 1.
> >>>>>
> >>>>> Two tests were adjusted due to changes in codegen. In both cases, the
> >>>>> old code performed loop unrolling once, but the new code does not:
> >>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
> >>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
> >>>>> -moverride=tune=none):
> >>>>> f_int64_t_32:
> >>>>>      cbz     w3, .L92
> >>>>>      mov     x4, 0
> >>>>>      uxtw    x3, w3
> >>>>> +       cntd    x5
> >>>>> +       whilelo p7.d, xzr, x3
> >>>>> +       mov     z29.s, w5
> >>>>>      mov     z31.s, w2
> >>>>> -       whilelo p6.d, xzr, x3
> >>>>> -       mov     x2, x3
> >>>>> -       index   z30.s, #0, #1
> >>>>> -       uqdecd  x2
> >>>>> -       ptrue   p5.b, all
> >>>>> -       whilelo p7.d, xzr, x2
> >>>>> +       index   z30.d, #0, #1
> >>>>> +       ptrue   p6.b, all
> >>>>>      .p2align 3,,7
> >>>>> .L94:
> >>>>> -       ld1d    z27.d, p7/z, [x0, #1, mul vl]
> >>>>> -       ld1d    z28.d, p6/z, [x0]
> >>>>> -       movprfx z29, z31
> >>>>> -       mul     z29.s, p5/m, z29.s, z30.s
> >>>>> -       incw    x4
> >>>>> -       uunpklo z0.d, z29.s
> >>>>> -       uunpkhi z29.d, z29.s
> >>>>> -       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
> >>>>> -       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
> >>>>> -       add     z25.d, z28.d, z25.d
> >>>>> +       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
> >>>>> +       movprfx z28, z31
> >>>>> +       mul     z28.s, p6/m, z28.s, z30.s
> >>>>> +       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
> >>>>>      add     z26.d, z27.d, z26.d
> >>>>> -       st1d    z26.d, p7, [x0, #1, mul vl]
> >>>>> -       whilelo p7.d, x4, x2
> >>>>> -       st1d    z25.d, p6, [x0]
> >>>>> -       incw    z30.s
> >>>>> -       incb    x0, all, mul #2
> >>>>> -       whilelo p6.d, x4, x3
> >>>>> +       st1d    z26.d, p7, [x0, x4, lsl 3]
> >>>>> +       add     z30.s, z30.s, z29.s
> >>>>> +       incd    x4
> >>>>> +       whilelo p7.d, x4, x3
> >>>>>      b.any   .L94
> >>>>> .L92:
> >>>>>      ret
> >>>>>
> >>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
> >>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
> >>>>> -moverride=tune=none):
> >>>>> f_int64_t_32:
> >>>>>      cbz     w3, .L84
> >>>>> -       addvl   x5, x1, #1
> >>>>>      mov     x4, 0
> >>>>>      uxtw    x3, w3
> >>>>> -       mov     z31.s, w2
> >>>>> +       cntd    x5
> >>>>>      whilelo p7.d, xzr, x3
> >>>>> -       mov     x2, x3
> >>>>> -       index   z30.s, #0, #1
> >>>>> -       uqdecd  x2
> >>>>> -       ptrue   p5.b, all
> >>>>> -       whilelo p6.d, xzr, x2
> >>>>> +       mov     z29.s, w5
> >>>>> +       mov     z31.s, w2
> >>>>> +       index   z30.d, #0, #1
> >>>>> +       ptrue   p6.b, all
> >>>>>      .p2align 3,,7
> >>>>> .L86:
> >>>>> -       ld1d    z28.d, p7/z, [x1, x4, lsl 3]
> >>>>> -       ld1d    z27.d, p6/z, [x5, x4, lsl 3]
> >>>>> -       movprfx z29, z30
> >>>>> -       mul     z29.s, p5/m, z29.s, z31.s
> >>>>> -       add     z28.d, z28.d, #1
> >>>>> -       uunpklo z26.d, z29.s
> >>>>> -       st1d    z28.d, p7, [x0, z26.d, lsl 3]
> >>>>> -       incw    x4
> >>>>> -       uunpkhi z29.d, z29.s
> >>>>> +       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
> >>>>> +       movprfx z28, z30
> >>>>> +       mul     z28.s, p6/m, z28.s, z31.s
> >>>>>      add     z27.d, z27.d, #1
> >>>>> -       whilelo p6.d, x4, x2
> >>>>> -       st1d    z27.d, p7, [x0, z29.d, lsl 3]
> >>>>> -       incw    z30.s
> >>>>> +       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
> >>>>> +       incd    x4
> >>>>> +       add     z30.s, z30.s, z29.s
> >>>>>      whilelo p7.d, x4, x3
> >>>>>      b.any   .L86
> >>>>> .L84:
> >>>>>    ret
> >>>>>
> >>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no
> >>>>> regression. We also ran SPEC2017 with -mcpu=generic on a Grace machine 
> >>>>> and saw
> >>>>> no non-noise impact on performance. We would appreciate help with wider
> >>>>> benchmarking on other platforms, if necessary.
> >>>>> OK for mainline?
> >>>>>
> >>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
> >>>>>
> >>>>> gcc/
> >>>>>    * config/aarch64/aarch64-tuning-flags.def: Remove
> >>>>>    use_new_vector_costs as tuning option.
> >>>>>    * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
> >>>>>    Remove.
> >>>>>    (aarch64_vector_costs::add_stmt_cost): Remove use of
> >>>>>    aarch64_use_new_vector_costs_p and guard call to
> >>>>>    vect_is_store_elt_extraction with count > 1.
> >>>>>    (aarch64_vector_costs::finish_cost): Remove use of
> >>>>>    aarch64_use_new_vector_costs_p.
> >>>>>    * config/aarch64/tuning_models/cortexx925.h: Remove
> >>>>>    AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
> >>>>>    * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
> >>>>>    * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
> >>>>>    * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
> >>>>>    * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
> >>>>>    * config/aarch64/tuning_models/neoversen2.h: Likewise.
> >>>>>    * config/aarch64/tuning_models/neoversen3.h: Likewise.
> >>>>>    * config/aarch64/tuning_models/neoversev1.h: Likewise.
> >>>>>    * config/aarch64/tuning_models/neoversev2.h: Likewise.
> >>>>>    * config/aarch64/tuning_models/neoversev3.h: Likewise.
> >>>>>    * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
> >>>>>
> >>>>> gcc/testsuite/
> >>>>>    * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
> >>>>>    * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
> >>>>> ---
> >>>>> gcc/config/aarch64/aarch64-tuning-flags.def   |  2 --
> >>>>> gcc/config/aarch64/aarch64.cc                 | 22 +++++--------------
> >>>>> gcc/config/aarch64/tuning_models/cortexx925.h |  1 -
> >>>>> .../aarch64/tuning_models/fujitsu_monaka.h    |  1 -
> >>>>> .../aarch64/tuning_models/generic_armv8_a.h   |  1 -
> >>>>> .../aarch64/tuning_models/generic_armv9_a.h   |  1 -
> >>>>> .../aarch64/tuning_models/neoverse512tvb.h    |  1 -
> >>>>> gcc/config/aarch64/tuning_models/neoversen2.h |  1 -
> >>>>> gcc/config/aarch64/tuning_models/neoversen3.h |  1 -
> >>>>> gcc/config/aarch64/tuning_models/neoversev1.h |  1 -
> >>>>> gcc/config/aarch64/tuning_models/neoversev2.h |  1 -
> >>>>> gcc/config/aarch64/tuning_models/neoversev3.h |  1 -
> >>>>> .../aarch64/tuning_models/neoversev3ae.h      |  1 -
> >>>>> .../gcc.target/aarch64/sve/strided_load_2.c   |  2 +-
> >>>>> .../gcc.target/aarch64/sve/strided_store_2.c  |  2 +-
> >>>>> 15 files changed, 7 insertions(+), 32 deletions(-)
> >>>>>
> >>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
> >>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>>> index 5939602576b..ed345b13ed3 100644
> >>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", 
> >>>>> CHEAP_SHIFT_EXTEND)
> >>>>>
> >>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", 
> >>>>> CSE_SVE_VL_CONSTANTS)
> >>>>>
> >>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", 
> >>>>> USE_NEW_VECTOR_COSTS)
> >>>>> -
> >>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", 
> >>>>> MATCHED_VECTOR_THROUGHPUT)
> >>>>>
> >>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", 
> >>>>> AVOID_CROSS_LOOP_FMA)
> >>>>> diff --git a/gcc/config/aarch64/aarch64.cc 
> >>>>> b/gcc/config/aarch64/aarch64.cc
> >>>>> index 43238aefef2..03806671c97 100644
> >>>>> --- a/gcc/config/aarch64/aarch64.cc
> >>>>> +++ b/gcc/config/aarch64/aarch64.cc
> >>>>> @@ -16566,16 +16566,6 @@ aarch64_vectorize_create_costs (vec_info 
> >>>>> *vinfo, bool costing_for_scalar)
> >>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar);
> >>>>> }
> >>>>>
> >>>>> -/* Return true if the current CPU should use the new costs defined
> >>>>> -   in GCC 11.  This should be removed for GCC 12 and above, with the
> >>>>> -   costs applying to all CPUs instead.  */
> >>>>> -static bool
> >>>>> -aarch64_use_new_vector_costs_p ()
> >>>>> -{
> >>>>> -  return (aarch64_tune_params.extra_tuning_flags
> >>>>> -       & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
> >>>>> -}
> >>>>> -
> >>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE.  */
> >>>>> static const simd_vec_cost *
> >>>>> aarch64_simd_vec_costs (tree vectype)
> >>>>> @@ -17494,7 +17484,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
> >>>>> vect_cost_for_stmt kind,
> >>>>>
> >>>>> /* Do one-time initialization based on the vinfo.  */
> >>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
> >>>>> -  if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
> >>>>> +  if (!m_analyzed_vinfo)
> >>>>>   {
> >>>>>     if (loop_vinfo)
> >>>>>    analyze_loop_vinfo (loop_vinfo);
> >>>>> @@ -17512,12 +17502,12 @@ aarch64_vector_costs::add_stmt_cost (int 
> >>>>> count, vect_cost_for_stmt kind,
> >>>>>
> >>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
> >>>>>    of just looking at KIND.  */
> >>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
> >>>>> +  if (stmt_info)
> >>>>>   {
> >>>>>     /* If we scalarize a strided store, the vectorizer costs one
> >>>>>     vec_to_scalar for each element.  However, we can store the first
> >>>>>     element using an FP store without a separate extract step.  */
> >>>>> -      if (vect_is_store_elt_extraction (kind, stmt_info))
> >>>>> +      if (vect_is_store_elt_extraction (kind, stmt_info) && count > 1)
> >>>>>    count -= 1;
> >>>>>
> >>>>>     stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
> >>>>> @@ -17577,7 +17567,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
> >>>>> vect_cost_for_stmt kind,
> >>>>> else
> >>>>>   m_num_last_promote_demote = 0;
> >>>>>
> >>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
> >>>>> +  if (stmt_info)
> >>>>>   {
> >>>>>     /* Account for any extra "embedded" costs that apply additively
> >>>>>     to the base cost calculated above.  */
> >>>>> @@ -17938,9 +17928,7 @@ aarch64_vector_costs::finish_cost (const 
> >>>>> vector_costs *uncast_scalar_costs)
> >>>>>
> >>>>> auto *scalar_costs
> >>>>>   = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
> >>>>> -  if (loop_vinfo
> >>>>> -      && m_vec_flags
> >>>>> -      && aarch64_use_new_vector_costs_p ())
> >>>>> +  if (loop_vinfo && m_vec_flags)
> >>>>>   {
> >>>>>     m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
> >>>>>                                         m_costs[vect_body]);
> >>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
> >>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>>> index eb9b89984b0..dafea96e924 100644
> >>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>>> @@ -219,7 +219,6 @@ static const struct tune_params cortexx925_tunings =
> >>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
> >>>>> &generic_prefetch_tune,
> >>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h 
> >>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>>> index 6a098497759..ac001927959 100644
> >>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>>> @@ -55,7 +55,6 @@ static const struct tune_params 
> >>>>> fujitsu_monaka_tunings =
> >>>>> 0, /* max_case_values.  */
> >>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
> >>>>> &generic_prefetch_tune,
> >>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> >>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h 
> >>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>>> index 9b1cbfc5bd2..7b534831340 100644
> >>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>>> @@ -183,7 +183,6 @@ static const struct tune_params 
> >>>>> generic_armv8_a_tunings =
> >>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
> >>>>> &generic_prefetch_tune,
> >>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> >>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
> >>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>>> index 48353a59939..562ef89c67b 100644
> >>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>>> @@ -249,7 +249,6 @@ static const struct tune_params 
> >>>>> generic_armv9_a_tunings =
> >>>>> 0, /* max_case_values.  */
> >>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
> >>>>> &generic_armv9a_prefetch_tune,
> >>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> >>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h 
> >>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>>> index c407b89a22f..fe4f7c10f73 100644
> >>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>>> @@ -156,7 +156,6 @@ static const struct tune_params 
> >>>>> neoverse512tvb_tunings =
> >>>>> 0, /* max_case_values.  */
> >>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
> >>>>> &generic_prefetch_tune,
> >>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> >>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
> >>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>>> index 18199ac206c..56be77423cb 100644
> >>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings =
> >>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
> >>>>> &generic_prefetch_tune,
> >>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h 
> >>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>>> index 4da85cfac0d..254ad5e27f8 100644
> >>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings =
> >>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
> >>>>> &generic_prefetch_tune,
> >>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> >>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
> >>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>>> index dd9120eee48..c7241cf23d7 100644
> >>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>>> @@ -227,7 +227,6 @@ static const struct tune_params neoversev1_tunings =
> >>>>> 0, /* max_case_values.  */
> >>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>  | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
> >>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
> >>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>>> index 1369de73991..96f55940649 100644
> >>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>>> @@ -232,7 +232,6 @@ static const struct tune_params neoversev2_tunings =
> >>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
> >>>>>  | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),        /* tune_flags.  */
> >>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h 
> >>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>>> index d8c82255378..f62ae67d355 100644
> >>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings =
> >>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
> >>>>> &generic_prefetch_tune,
> >>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h 
> >>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>>> index 7f050501ede..0233baf5e34 100644
> >>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>>> @@ -219,7 +219,6 @@ static const struct tune_params 
> >>>>> neoversev3ae_tunings =
> >>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> >>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
> >>>>> &generic_prefetch_tune,
> >>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c 
> >>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>>> index 762805ff54b..c334b7a6875 100644
> >>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>>> @@ -15,4 +15,4 @@
> >>>>>  so we vectorize the offset calculation.  This means that the
> >>>>>  64-bit version needs two copies.  */
> >>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, 
> >>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
> >>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
> >>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
> >>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
> >>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
> >>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c 
> >>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>>> index f0ea58e38e2..94cc63049bc 100644
> >>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>>> @@ -15,4 +15,4 @@
> >>>>>  so we vectorize the offset calculation.  This means that the
> >>>>>  64-bit version needs two copies.  */
> >>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], 
> >>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
> >>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
> >>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
> >>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
> >>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
> >>>>>
> >>>>
> >>>> --
> >>>> Richard Biener <rguent...@suse.de>
> >>>> SUSE Software Solutions Germany GmbH,
> >>>> Frankenstrasse 146, 90461 Nuernberg, Germany;
> >>>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG 
> >>>> Nuernberg)
> >
> >
>

Re: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

Reply via email to