Re: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

Jennifer Schmitz Thu, 12 Dec 2024 08:26:55 -0800


> On 6 Dec 2024, at 08:41, Jennifer Schmitz <jschm...@nvidia.com> wrote:
> 
> 
> 
>> On 5 Dec 2024, at 20:07, Richard Sandiford <richard.sandif...@arm.com> wrote:
>> 
>> External email: Use caution opening links or attachments
>> 
>> 
>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>>> On 5 Dec 2024, at 11:44, Richard Biener <rguent...@suse.de> wrote:
>>>> 
>>>> External email: Use caution opening links or attachments
>>>> 
>>>> 
>>>> On Thu, 5 Dec 2024, Jennifer Schmitz wrote:
>>>> 
>>>>> 
>>>>> 
>>>>>> On 17 Oct 2024, at 19:23, Richard Sandiford <richard.sandif...@arm.com> 
>>>>>> wrote:
>>>>>> 
>>>>>> External email: Use caution opening links or attachments
>>>>>> 
>>>>>> 
>>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>>>>>> [...]
>>>>>>> Looking at the diff of the vect dumps (below is a section of the diff 
>>>>>>> for strided_store_2.c), it seemed odd that vec_to_scalar operations 
>>>>>>> cost 0 now, instead of the previous cost of 2:
>>>>>>> 
>>>>>>> +strided_store_1.c:38:151: note:    === vectorizable_operation ===
>>>>>>> +strided_store_1.c:38:151: note:    vect_model_simple_cost: inside_cost 
>>>>>>> = 1, prologue_cost  = 0 .
>>>>>>> +strided_store_1.c:38:151: note:   ==> examining statement: *_6 = _7;
>>>>>>> +strided_store_1.c:38:151: note:   vect_is_simple_use: operand _3 + 
>>>>>>> 1.0e+0, type of def:    internal
>>>>>>> +strided_store_1.c:38:151: note:   Vectorizing an unaligned access.
>>>>>>> +Applying pattern match.pd:236, generic-match-9.cc:4128
>>>>>>> +Applying pattern match.pd:5285, generic-match-10.cc:4234
>>>>>>> +strided_store_1.c:38:151: note:   vect_model_store_cost: inside_cost = 
>>>>>>> 12, prologue_cost = 0 .
>>>>>>> *_2 1 times unaligned_load (misalign -1) costs 1 in body
>>>>>>> -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue
>>>>>>> _3 + 1.0e+0 1 times vector_stmt costs 1 in body
>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>>> +<unknown> 1 times vector_load costs 1 in prologue
>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>>> 
>>>>>>> Although the aarch64_use_new_vector_costs_p flag was used in multiple 
>>>>>>> places in aarch64.cc, the location that causes this behavior is this 
>>>>>>> one:
>>>>>>> unsigned
>>>>>>> aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
>>>>>>>                                  stmt_vec_info stmt_info, slp_tree,
>>>>>>>                                  tree vectype, int misalign,
>>>>>>>                                  vect_cost_model_location where)
>>>>>>> {
>>>>>>> [...]
>>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
>>>>>>>  of just looking at KIND.  */
>>>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>>>> +  if (stmt_info)
>>>>>>> {
>>>>>>>   /* If we scalarize a strided store, the vectorizer costs one
>>>>>>>      vec_to_scalar for each element.  However, we can store the first
>>>>>>>      element using an FP store without a separate extract step.  */
>>>>>>>   if (vect_is_store_elt_extraction (kind, stmt_info))
>>>>>>>     count -= 1;
>>>>>>> 
>>>>>>>   stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
>>>>>>>                                                   stmt_info, stmt_cost);
>>>>>>> 
>>>>>>>   if (vectype && m_vec_flags)
>>>>>>>     stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind,
>>>>>>>                                                     stmt_info, vectype,
>>>>>>>                                                     where, stmt_cost);
>>>>>>> }
>>>>>>> [...]
>>>>>>> return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ());
>>>>>>> }
>>>>>>> 
>>>>>>> Previously, for mtune=generic, this function returned a cost of 2 for a 
>>>>>>> vec_to_scalar operation in the vect body. Now "if (stmt_info)" is 
>>>>>>> entered and "if (vect_is_store_elt_extraction (kind, stmt_info))" 
>>>>>>> evaluates to true, which sets the count to 0 and leads to a return 
>>>>>>> value of 0.
>>>>>> 
>>>>>> At the time the code was written, a scalarised store would be costed
>>>>>> using one vec_to_scalar call into the backend, with the count parameter
>>>>>> set to the number of elements being stored.  The "count -= 1" was
>>>>>> supposed to lop off the leading element extraction, since we can store
>>>>>> lane 0 as a normal FP store.
>>>>>> 
>>>>>> The target-independent costing was later reworked so that it costs
>>>>>> each operation individually:
>>>>>> 
>>>>>>           for (i = 0; i < nstores; i++)
>>>>>>             {
>>>>>>               if (costing_p)
>>>>>>                 {
>>>>>>                   /* Only need vector extracting when there are more
>>>>>>                      than one stores.  */
>>>>>>                   if (nstores > 1)
>>>>>>                     inside_cost
>>>>>>                       += record_stmt_cost (cost_vec, 1, vec_to_scalar,
>>>>>>                                            stmt_info, 0, vect_body);
>>>>>>                   /* Take a single lane vector type store as scalar
>>>>>>                      store to avoid ICE like 110776.  */
>>>>>>                   if (VECTOR_TYPE_P (ltype)
>>>>>>                       && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
>>>>>>                     n_adjacent_stores++;
>>>>>>                   else
>>>>>>                     inside_cost
>>>>>>                       += record_stmt_cost (cost_vec, 1, scalar_store,
>>>>>>                                            stmt_info, 0, vect_body);
>>>>>>                   continue;
>>>>>>                 }
>>>>>> 
>>>>>> Unfortunately, there's no easy way of telling whether a particular call
>>>>>> is part of a group, and if so, which member of the group it is.
>>>>>> 
>>>>>> I suppose we could give up on the attempt to be (somewhat) accurate
>>>>>> and just disable the optimisation.  Or we could restrict it to count > 1,
>>>>>> since it might still be useful for gathers and scatters.
>>>>> I tried restricting the calls to vect_is_store_elt_extraction to count > 
>>>>> 1 and it seems to resolve the issue of costing vec_to_scalar operations 
>>>>> with 0 (see patch below).
>>>>> What are your thoughts on this?
>>>> 
>>>> Why didn't you pursue instead moving the vec_to_scalar cost together
>>>> with the n_adjacent_store handling?
>>> When I continued working on this patch, we had already reached stage 3 and 
>>> I was hesitant to introduce changes to the middle-end that were not 
>>> previously covered by this patch. So I tried if the issue could not be 
>>> resolved by making a small change in the backend.
>>> If you still advise to use the n_adjacent_store instead, I’m happy to look 
>>> into it again.
>> 
>> If Richard's ok with adjusting vectorizable_store for GCC 15 (which it
>> sounds like he is), then I agree that would be better.  Otherwise we'd
>> be creating technical debt to clean up for GCC 16.  And it is a regression
>> of sorts, so is stage 3 material from that POV.
>> 
>> (Incidentally, AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS was itself a
>> "let's clean this up next stage 1" thing, since we needed to add tuning
>> for a new CPU late during the cycle.  But of course, there were other
>> priorities when stage 1 actually came around, so it never actually
>> happened.  Thanks again for being the one to sort this out.)
> Thanks for your feedback. Then I will try to make it work in 
> vectorizable_store.
> Best,
> Jennifer
Below is the updated patch with a suggestion for the changes in 
vectorizable_store. It resolves the issue with the vec_to_scalar operations 
that were individually costed with 0.
We already tested it on aarch64, no regression, but we are still doing 
performance testing.
Can you give some feedback in the meantime on the patch itself?
Thanks,
Jennifer



This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and
use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
default. To that end, the function aarch64_use_new_vector_costs_p and its uses
were removed. To prevent costing vec_to_scalar operations with 0, as
described in
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
we adjusted vectorizable_store such that the variable n_adjacent_stores
also covers vec_to_scalar operations. This way vec_to_scalar operations
are not costed individually, but as a group.

Two tests were adjusted due to changes in codegen. In both cases, the
old code performed loop unrolling once, but the new code does not:
Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
-O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -moverride=tune=none):
f_int64_t_32:
        cbz     w3, .L92
        mov     x4, 0
        uxtw    x3, w3
+       cntd    x5
+       whilelo p7.d, xzr, x3
+       mov     z29.s, w5
        mov     z31.s, w2
-       whilelo p6.d, xzr, x3
-       mov     x2, x3
-       index   z30.s, #0, #1
-       uqdecd  x2
-       ptrue   p5.b, all
-       whilelo p7.d, xzr, x2
+       index   z30.d, #0, #1
+       ptrue   p6.b, all
        .p2align 3,,7
 .L94:
-       ld1d    z27.d, p7/z, [x0, #1, mul vl]
-       ld1d    z28.d, p6/z, [x0]
-       movprfx z29, z31
-       mul     z29.s, p5/m, z29.s, z30.s
-       incw    x4
-       uunpklo z0.d, z29.s
-       uunpkhi z29.d, z29.s
-       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
-       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
-       add     z25.d, z28.d, z25.d
+       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
+       movprfx z28, z31
+       mul     z28.s, p6/m, z28.s, z30.s
+       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
        add     z26.d, z27.d, z26.d
-       st1d    z26.d, p7, [x0, #1, mul vl]
-       whilelo p7.d, x4, x2
-       st1d    z25.d, p6, [x0]
-       incw    z30.s
-       incb    x0, all, mul #2
-       whilelo p6.d, x4, x3
+       st1d    z26.d, p7, [x0, x4, lsl 3]
+       add     z30.s, z30.s, z29.s
+       incd    x4
+       whilelo p7.d, x4, x3
        b.any   .L94
 .L92:
        ret

Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
-O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -moverride=tune=none):
f_int64_t_32:
        cbz     w3, .L84
-       addvl   x5, x1, #1
        mov     x4, 0
        uxtw    x3, w3
-       mov     z31.s, w2
+       cntd    x5
        whilelo p7.d, xzr, x3
-       mov     x2, x3
-       index   z30.s, #0, #1
-       uqdecd  x2
-       ptrue   p5.b, all
-       whilelo p6.d, xzr, x2
+       mov     z29.s, w5
+       mov     z31.s, w2
+       index   z30.d, #0, #1
+       ptrue   p6.b, all
        .p2align 3,,7
 .L86:
-       ld1d    z28.d, p7/z, [x1, x4, lsl 3]
-       ld1d    z27.d, p6/z, [x5, x4, lsl 3]
-       movprfx z29, z30
-       mul     z29.s, p5/m, z29.s, z31.s
-       add     z28.d, z28.d, #1
-       uunpklo z26.d, z29.s
-       st1d    z28.d, p7, [x0, z26.d, lsl 3]
-       incw    x4
-       uunpkhi z29.d, z29.s
+       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
+       movprfx z28, z30
+       mul     z28.s, p6/m, z28.s, z31.s
        add     z27.d, z27.d, #1
-       whilelo p6.d, x4, x2
-       st1d    z27.d, p7, [x0, z29.d, lsl 3]
-       incw    z30.s
+       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
+       incd    x4
+       add     z30.s, z30.s, z29.s
        whilelo p7.d, x4, x3
        b.any   .L86
 .L84:
        ret

The patch was bootstrapped and tested on aarch64-linux-gnu, no
regression.
OK for mainline?

Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>

gcc/
        * tree-vect-stmts.cc (vectorizable_store): Extend the use of
        n_adjacent_stores to also cover vec_to_scalar operations.
        * config/aarch64/aarch64-tuning-flags.def: Remove
        use_new_vector_costs as tuning option.
        * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
        Remove.
        (aarch64_vector_costs::add_stmt_cost): Remove use of
        aarch64_use_new_vector_costs_p.
        (aarch64_vector_costs::finish_cost): Remove use of
        aarch64_use_new_vector_costs_p.
        * config/aarch64/tuning_models/cortexx925.h: Remove
        AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
        * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
        * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
        * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
        * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
        * config/aarch64/tuning_models/neoversen2.h: Likewise.
        * config/aarch64/tuning_models/neoversen3.h: Likewise.
        * config/aarch64/tuning_models/neoversev1.h: Likewise.
        * config/aarch64/tuning_models/neoversev2.h: Likewise.
        * config/aarch64/tuning_models/neoversev3.h: Likewise.
        * config/aarch64/tuning_models/neoversev3ae.h: Likewise.

gcc/testsuite/
        * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
        * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
---
 gcc/config/aarch64/aarch64-tuning-flags.def   |  2 --
 gcc/config/aarch64/aarch64.cc                 | 20 +++----------
 gcc/config/aarch64/tuning_models/cortexx925.h |  1 -
 .../aarch64/tuning_models/fujitsu_monaka.h    |  1 -
 .../aarch64/tuning_models/generic_armv8_a.h   |  1 -
 .../aarch64/tuning_models/generic_armv9_a.h   |  1 -
 .../aarch64/tuning_models/neoverse512tvb.h    |  1 -
 gcc/config/aarch64/tuning_models/neoversen2.h |  1 -
 gcc/config/aarch64/tuning_models/neoversen3.h |  1 -
 gcc/config/aarch64/tuning_models/neoversev1.h |  1 -
 gcc/config/aarch64/tuning_models/neoversev2.h |  1 -
 gcc/config/aarch64/tuning_models/neoversev3.h |  1 -
 .../aarch64/tuning_models/neoversev3ae.h      |  1 -
 .../gcc.target/aarch64/sve/strided_load_2.c   |  2 +-
 .../gcc.target/aarch64/sve/strided_store_2.c  |  2 +-
 gcc/tree-vect-stmts.cc                        | 29 ++++++++++---------
 16 files changed, 22 insertions(+), 44 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
b/gcc/config/aarch64/aarch64-tuning-flags.def
index ffbff20e29c..1de633c739b 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", 
CHEAP_SHIFT_EXTEND)
 
 AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
 
-AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS)
-
 AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", 
MATCHED_VECTOR_THROUGHPUT)
 
 AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA)
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 77a2a6bfa3a..71fba9cc63b 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo, bool 
costing_for_scalar)
   return new aarch64_vector_costs (vinfo, costing_for_scalar);
 }
 
-/* Return true if the current CPU should use the new costs defined
-   in GCC 11.  This should be removed for GCC 12 and above, with the
-   costs applying to all CPUs instead.  */
-static bool
-aarch64_use_new_vector_costs_p ()
-{
-  return (aarch64_tune_params.extra_tuning_flags
-         & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
-}
-
 /* Return the appropriate SIMD costs for vectors of type VECTYPE.  */
 static const simd_vec_cost *
 aarch64_simd_vec_costs (tree vectype)
@@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
vect_cost_for_stmt kind,
 
   /* Do one-time initialization based on the vinfo.  */
   loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
-  if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
+  if (!m_analyzed_vinfo)
     {
       if (loop_vinfo)
        analyze_loop_vinfo (loop_vinfo);
@@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
vect_cost_for_stmt kind,
 
   /* Try to get a more accurate cost by looking at STMT_INFO instead
      of just looking at KIND.  */
-  if (stmt_info && aarch64_use_new_vector_costs_p ())
+  if (stmt_info)
     {
       /* If we scalarize a strided store, the vectorizer costs one
         vec_to_scalar for each element.  However, we can store the first
@@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
vect_cost_for_stmt kind,
   else
     m_num_last_promote_demote = 0;
 
-  if (stmt_info && aarch64_use_new_vector_costs_p ())
+  if (stmt_info)
     {
       /* Account for any extra "embedded" costs that apply additively
         to the base cost calculated above.  */
@@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const vector_costs 
*uncast_scalar_costs)
 
   auto *scalar_costs
     = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
-  if (loop_vinfo
-      && m_vec_flags
-      && aarch64_use_new_vector_costs_p ())
+  if (loop_vinfo && m_vec_flags)
     {
       m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
                                             m_costs[vect_body]);
diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
b/gcc/config/aarch64/tuning_models/cortexx925.h
index b2ff716157a..0a8eff69307 100644
--- a/gcc/config/aarch64/tuning_models/cortexx925.h
+++ b/gcc/config/aarch64/tuning_models/cortexx925.h
@@ -219,7 +219,6 @@ static const struct tune_params cortexx925_tunings =
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_BASE
    | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
    | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
   &generic_prefetch_tune,
diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h 
b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
index 2d704ecd110..a564528f43d 100644
--- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
+++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
@@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings =
   0,   /* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_BASE
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
   &generic_prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h 
b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
index bdd309ab03d..f090d5cde50 100644
--- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
+++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
@@ -183,7 +183,6 @@ static const struct tune_params generic_armv8_a_tunings =
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_BASE
    | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
   &generic_prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
index a05a9ab92a2..4c33c147444 100644
--- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
+++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
@@ -249,7 +249,6 @@ static const struct tune_params generic_armv9_a_tunings =
   0,   /* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_BASE
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
   &generic_armv9a_prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h 
b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
index c407b89a22f..fe4f7c10f73 100644
--- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
+++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
@@ -156,7 +156,6 @@ static const struct tune_params neoverse512tvb_tunings =
   0,   /* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
   &generic_prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
b/gcc/config/aarch64/tuning_models/neoversen2.h
index fd5f8f37370..0c74068da2c 100644
--- a/gcc/config/aarch64/tuning_models/neoversen2.h
+++ b/gcc/config/aarch64/tuning_models/neoversen2.h
@@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings =
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_BASE
    | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
    | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
   &generic_prefetch_tune,
diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h 
b/gcc/config/aarch64/tuning_models/neoversen3.h
index 8b156c2fe4d..9d4e1be171a 100644
--- a/gcc/config/aarch64/tuning_models/neoversen3.h
+++ b/gcc/config/aarch64/tuning_models/neoversen3.h
@@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings =
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_BASE
    | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),    /* tune_flags.  */
   &generic_prefetch_tune,
   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
b/gcc/config/aarch64/tuning_models/neoversev1.h
index 23c121d8652..85a78bb2bef 100644
--- a/gcc/config/aarch64/tuning_models/neoversev1.h
+++ b/gcc/config/aarch64/tuning_models/neoversev1.h
@@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings =
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_BASE
    | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
    | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
   &generic_prefetch_tune,
diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
b/gcc/config/aarch64/tuning_models/neoversev2.h
index 40af5f47f4f..1dd452beb8d 100644
--- a/gcc/config/aarch64/tuning_models/neoversev2.h
+++ b/gcc/config/aarch64/tuning_models/neoversev2.h
@@ -232,7 +232,6 @@ static const struct tune_params neoversev2_tunings =
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_BASE
    | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
    | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
    | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),  /* tune_flags.  */
diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h 
b/gcc/config/aarch64/tuning_models/neoversev3.h
index d65d74bfecf..d0ba5b1aef6 100644
--- a/gcc/config/aarch64/tuning_models/neoversev3.h
+++ b/gcc/config/aarch64/tuning_models/neoversev3.h
@@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings =
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_BASE
    | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
    | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
   &generic_prefetch_tune,
diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h 
b/gcc/config/aarch64/tuning_models/neoversev3ae.h
index 7b7fa0b4b08..a1572048503 100644
--- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
+++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
@@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings =
   tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_BASE
    | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
    | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
   &generic_prefetch_tune,
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c 
b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
index 762805ff54b..c334b7a6875 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
@@ -15,4 +15,4 @@
    so we vectorize the offset calculation.  This means that the
    64-bit version needs two copies.  */
 /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, \[x[0-9]+, 
z[0-9]+.s, uxtw 2\]\n} 3 } } */
-/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, \[x[0-9]+, 
z[0-9]+.d, lsl 3\]\n} 15 } } */
+/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, \[x[0-9]+, 
z[0-9]+.d, lsl 3\]\n} 9 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c 
b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
index f0ea58e38e2..94cc63049bc 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
@@ -15,4 +15,4 @@
    so we vectorize the offset calculation.  This means that the
    64-bit version needs two copies.  */
 /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], \[x[0-9]+, 
z[0-9]+.s, uxtw 2\]\n} 3 } } */
-/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[x[0-9]+, 
z[0-9]+.d, lsl 3\]\n} 15 } } */
+/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[x[0-9]+, 
z[0-9]+.d, lsl 3\]\n} 9 } } */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index be1139a423c..6d7d28c4702 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -8834,19 +8834,16 @@ vectorizable_store (vec_info *vinfo,
                {
                  if (costing_p)
                    {
-                     /* Only need vector extracting when there are more
-                        than one stores.  */
-                     if (nstores > 1)
-                       inside_cost
-                         += record_stmt_cost (cost_vec, 1, vec_to_scalar,
-                                              stmt_info, slp_node,
-                                              0, vect_body);
                      /* Take a single lane vector type store as scalar
                         store to avoid ICE like 110776.  */
-                     if (VECTOR_TYPE_P (ltype)
-                         && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
+                     bool single_lane_vec_p =
+                       VECTOR_TYPE_P (ltype)
+                       && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U);
+                     /* Only need vector extracting when there are more
+                        than one stores.  */
+                     if (nstores > 1 || single_lane_vec_p)
                        n_adjacent_stores++;
-                     else
+                     if (!single_lane_vec_p)
                        inside_cost
                          += record_stmt_cost (cost_vec, 1, scalar_store,
                                               stmt_info, 0, vect_body);
@@ -8905,9 +8902,15 @@ vectorizable_store (vec_info *vinfo,
       if (costing_p)
        {
          if (n_adjacent_stores > 0)
-           vect_get_store_cost (vinfo, stmt_info, slp_node, n_adjacent_stores,
-                                alignment_support_scheme, misalignment,
-                                &inside_cost, cost_vec);
+           {
+             vect_get_store_cost (vinfo, stmt_info, slp_node, 
n_adjacent_stores,
+                                  alignment_support_scheme, misalignment,
+                                  &inside_cost, cost_vec);
+             inside_cost
+               += record_stmt_cost (cost_vec, n_adjacent_stores, vec_to_scalar,
+                                    stmt_info, slp_node,
+                                    0, vect_body);
+           }
          if (dump_enabled_p ())
            dump_printf_loc (MSG_NOTE, vect_location,
                             "vect_model_store_cost: inside_cost = %d, "
-- 
2.44.0


>> 
>> Richard
>> 
>>> Thanks,
>>> Jennifer
>>>> 
>>>>> Thanks,
>>>>> Jennifer
>>>>> 
>>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and
>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
>>>>> default. To that end, the function aarch64_use_new_vector_costs_p and its 
>>>>> uses
>>>>> were removed. To prevent costing vec_to_scalar operations with 0, as
>>>>> described in
>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
>>>>> we guarded the call to vect_is_store_elt_extraction in
>>>>> aarch64_vector_costs::add_stmt_cost by count > 1.
>>>>> 
>>>>> Two tests were adjusted due to changes in codegen. In both cases, the
>>>>> old code performed loop unrolling once, but the new code does not:
>>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>>>>> -moverride=tune=none):
>>>>> f_int64_t_32:
>>>>>      cbz     w3, .L92
>>>>>      mov     x4, 0
>>>>>      uxtw    x3, w3
>>>>> +       cntd    x5
>>>>> +       whilelo p7.d, xzr, x3
>>>>> +       mov     z29.s, w5
>>>>>      mov     z31.s, w2
>>>>> -       whilelo p6.d, xzr, x3
>>>>> -       mov     x2, x3
>>>>> -       index   z30.s, #0, #1
>>>>> -       uqdecd  x2
>>>>> -       ptrue   p5.b, all
>>>>> -       whilelo p7.d, xzr, x2
>>>>> +       index   z30.d, #0, #1
>>>>> +       ptrue   p6.b, all
>>>>>      .p2align 3,,7
>>>>> .L94:
>>>>> -       ld1d    z27.d, p7/z, [x0, #1, mul vl]
>>>>> -       ld1d    z28.d, p6/z, [x0]
>>>>> -       movprfx z29, z31
>>>>> -       mul     z29.s, p5/m, z29.s, z30.s
>>>>> -       incw    x4
>>>>> -       uunpklo z0.d, z29.s
>>>>> -       uunpkhi z29.d, z29.s
>>>>> -       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
>>>>> -       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
>>>>> -       add     z25.d, z28.d, z25.d
>>>>> +       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
>>>>> +       movprfx z28, z31
>>>>> +       mul     z28.s, p6/m, z28.s, z30.s
>>>>> +       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
>>>>>      add     z26.d, z27.d, z26.d
>>>>> -       st1d    z26.d, p7, [x0, #1, mul vl]
>>>>> -       whilelo p7.d, x4, x2
>>>>> -       st1d    z25.d, p6, [x0]
>>>>> -       incw    z30.s
>>>>> -       incb    x0, all, mul #2
>>>>> -       whilelo p6.d, x4, x3
>>>>> +       st1d    z26.d, p7, [x0, x4, lsl 3]
>>>>> +       add     z30.s, z30.s, z29.s
>>>>> +       incd    x4
>>>>> +       whilelo p7.d, x4, x3
>>>>>      b.any   .L94
>>>>> .L92:
>>>>>      ret
>>>>> 
>>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>>>>> -moverride=tune=none):
>>>>> f_int64_t_32:
>>>>>      cbz     w3, .L84
>>>>> -       addvl   x5, x1, #1
>>>>>      mov     x4, 0
>>>>>      uxtw    x3, w3
>>>>> -       mov     z31.s, w2
>>>>> +       cntd    x5
>>>>>      whilelo p7.d, xzr, x3
>>>>> -       mov     x2, x3
>>>>> -       index   z30.s, #0, #1
>>>>> -       uqdecd  x2
>>>>> -       ptrue   p5.b, all
>>>>> -       whilelo p6.d, xzr, x2
>>>>> +       mov     z29.s, w5
>>>>> +       mov     z31.s, w2
>>>>> +       index   z30.d, #0, #1
>>>>> +       ptrue   p6.b, all
>>>>>      .p2align 3,,7
>>>>> .L86:
>>>>> -       ld1d    z28.d, p7/z, [x1, x4, lsl 3]
>>>>> -       ld1d    z27.d, p6/z, [x5, x4, lsl 3]
>>>>> -       movprfx z29, z30
>>>>> -       mul     z29.s, p5/m, z29.s, z31.s
>>>>> -       add     z28.d, z28.d, #1
>>>>> -       uunpklo z26.d, z29.s
>>>>> -       st1d    z28.d, p7, [x0, z26.d, lsl 3]
>>>>> -       incw    x4
>>>>> -       uunpkhi z29.d, z29.s
>>>>> +       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
>>>>> +       movprfx z28, z30
>>>>> +       mul     z28.s, p6/m, z28.s, z31.s
>>>>>      add     z27.d, z27.d, #1
>>>>> -       whilelo p6.d, x4, x2
>>>>> -       st1d    z27.d, p7, [x0, z29.d, lsl 3]
>>>>> -       incw    z30.s
>>>>> +       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
>>>>> +       incd    x4
>>>>> +       add     z30.s, z30.s, z29.s
>>>>>      whilelo p7.d, x4, x3
>>>>>      b.any   .L86
>>>>> .L84:
>>>>>    ret
>>>>> 
>>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no
>>>>> regression. We also ran SPEC2017 with -mcpu=generic on a Grace machine 
>>>>> and saw
>>>>> no non-noise impact on performance. We would appreciate help with wider
>>>>> benchmarking on other platforms, if necessary.
>>>>> OK for mainline?
>>>>> 
>>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>>>> 
>>>>> gcc/
>>>>>    * config/aarch64/aarch64-tuning-flags.def: Remove
>>>>>    use_new_vector_costs as tuning option.
>>>>>    * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
>>>>>    Remove.
>>>>>    (aarch64_vector_costs::add_stmt_cost): Remove use of
>>>>>    aarch64_use_new_vector_costs_p and guard call to
>>>>>    vect_is_store_elt_extraction with count > 1.
>>>>>    (aarch64_vector_costs::finish_cost): Remove use of
>>>>>    aarch64_use_new_vector_costs_p.
>>>>>    * config/aarch64/tuning_models/cortexx925.h: Remove
>>>>>    AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
>>>>>    * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
>>>>>    * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
>>>>>    * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
>>>>>    * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
>>>>>    * config/aarch64/tuning_models/neoversen2.h: Likewise.
>>>>>    * config/aarch64/tuning_models/neoversen3.h: Likewise.
>>>>>    * config/aarch64/tuning_models/neoversev1.h: Likewise.
>>>>>    * config/aarch64/tuning_models/neoversev2.h: Likewise.
>>>>>    * config/aarch64/tuning_models/neoversev3.h: Likewise.
>>>>>    * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
>>>>> 
>>>>> gcc/testsuite/
>>>>>    * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
>>>>>    * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
>>>>> ---
>>>>> gcc/config/aarch64/aarch64-tuning-flags.def   |  2 --
>>>>> gcc/config/aarch64/aarch64.cc                 | 22 +++++--------------
>>>>> gcc/config/aarch64/tuning_models/cortexx925.h |  1 -
>>>>> .../aarch64/tuning_models/fujitsu_monaka.h    |  1 -
>>>>> .../aarch64/tuning_models/generic_armv8_a.h   |  1 -
>>>>> .../aarch64/tuning_models/generic_armv9_a.h   |  1 -
>>>>> .../aarch64/tuning_models/neoverse512tvb.h    |  1 -
>>>>> gcc/config/aarch64/tuning_models/neoversen2.h |  1 -
>>>>> gcc/config/aarch64/tuning_models/neoversen3.h |  1 -
>>>>> gcc/config/aarch64/tuning_models/neoversev1.h |  1 -
>>>>> gcc/config/aarch64/tuning_models/neoversev2.h |  1 -
>>>>> gcc/config/aarch64/tuning_models/neoversev3.h |  1 -
>>>>> .../aarch64/tuning_models/neoversev3ae.h      |  1 -
>>>>> .../gcc.target/aarch64/sve/strided_load_2.c   |  2 +-
>>>>> .../gcc.target/aarch64/sve/strided_store_2.c  |  2 +-
>>>>> 15 files changed, 7 insertions(+), 32 deletions(-)
>>>>> 
>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
>>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>> index 5939602576b..ed345b13ed3 100644
>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", 
>>>>> CHEAP_SHIFT_EXTEND)
>>>>> 
>>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
>>>>> 
>>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", 
>>>>> USE_NEW_VECTOR_COSTS)
>>>>> -
>>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", 
>>>>> MATCHED_VECTOR_THROUGHPUT)
>>>>> 
>>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA)
>>>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>>>>> index 43238aefef2..03806671c97 100644
>>>>> --- a/gcc/config/aarch64/aarch64.cc
>>>>> +++ b/gcc/config/aarch64/aarch64.cc
>>>>> @@ -16566,16 +16566,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo, 
>>>>> bool costing_for_scalar)
>>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar);
>>>>> }
>>>>> 
>>>>> -/* Return true if the current CPU should use the new costs defined
>>>>> -   in GCC 11.  This should be removed for GCC 12 and above, with the
>>>>> -   costs applying to all CPUs instead.  */
>>>>> -static bool
>>>>> -aarch64_use_new_vector_costs_p ()
>>>>> -{
>>>>> -  return (aarch64_tune_params.extra_tuning_flags
>>>>> -       & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
>>>>> -}
>>>>> -
>>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE.  */
>>>>> static const simd_vec_cost *
>>>>> aarch64_simd_vec_costs (tree vectype)
>>>>> @@ -17494,7 +17484,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>>>>> vect_cost_for_stmt kind,
>>>>> 
>>>>> /* Do one-time initialization based on the vinfo.  */
>>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
>>>>> -  if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
>>>>> +  if (!m_analyzed_vinfo)
>>>>>   {
>>>>>     if (loop_vinfo)
>>>>>    analyze_loop_vinfo (loop_vinfo);
>>>>> @@ -17512,12 +17502,12 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>>>>> vect_cost_for_stmt kind,
>>>>> 
>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
>>>>>    of just looking at KIND.  */
>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>> +  if (stmt_info)
>>>>>   {
>>>>>     /* If we scalarize a strided store, the vectorizer costs one
>>>>>     vec_to_scalar for each element.  However, we can store the first
>>>>>     element using an FP store without a separate extract step.  */
>>>>> -      if (vect_is_store_elt_extraction (kind, stmt_info))
>>>>> +      if (vect_is_store_elt_extraction (kind, stmt_info) && count > 1)
>>>>>    count -= 1;
>>>>> 
>>>>>     stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
>>>>> @@ -17577,7 +17567,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>>>>> vect_cost_for_stmt kind,
>>>>> else
>>>>>   m_num_last_promote_demote = 0;
>>>>> 
>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>> +  if (stmt_info)
>>>>>   {
>>>>>     /* Account for any extra "embedded" costs that apply additively
>>>>>     to the base cost calculated above.  */
>>>>> @@ -17938,9 +17928,7 @@ aarch64_vector_costs::finish_cost (const 
>>>>> vector_costs *uncast_scalar_costs)
>>>>> 
>>>>> auto *scalar_costs
>>>>>   = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
>>>>> -  if (loop_vinfo
>>>>> -      && m_vec_flags
>>>>> -      && aarch64_use_new_vector_costs_p ())
>>>>> +  if (loop_vinfo && m_vec_flags)
>>>>>   {
>>>>>     m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>>>>                                         m_costs[vect_body]);
>>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
>>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>> index eb9b89984b0..dafea96e924 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>> @@ -219,7 +219,6 @@ static const struct tune_params cortexx925_tunings =
>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
>>>>> &generic_prefetch_tune,
>>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h 
>>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>> index 6a098497759..ac001927959 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings =
>>>>> 0, /* max_case_values.  */
>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>> &generic_prefetch_tune,
>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h 
>>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>> index 9b1cbfc5bd2..7b534831340 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>> @@ -183,7 +183,6 @@ static const struct tune_params 
>>>>> generic_armv8_a_tunings =
>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>> &generic_prefetch_tune,
>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
>>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>> index 48353a59939..562ef89c67b 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>> @@ -249,7 +249,6 @@ static const struct tune_params 
>>>>> generic_armv9_a_tunings =
>>>>> 0, /* max_case_values.  */
>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>> &generic_armv9a_prefetch_tune,
>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h 
>>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>> index c407b89a22f..fe4f7c10f73 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>> @@ -156,7 +156,6 @@ static const struct tune_params 
>>>>> neoverse512tvb_tunings =
>>>>> 0, /* max_case_values.  */
>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>> &generic_prefetch_tune,
>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
>>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>> index 18199ac206c..56be77423cb 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings =
>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
>>>>> &generic_prefetch_tune,
>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h 
>>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>> index 4da85cfac0d..254ad5e27f8 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings =
>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>> &generic_prefetch_tune,
>>>>> AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
>>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>> index dd9120eee48..c7241cf23d7 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>> @@ -227,7 +227,6 @@ static const struct tune_params neoversev1_tunings =
>>>>> 0, /* max_case_values.  */
>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>  | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
>>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>> index 1369de73991..96f55940649 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>> @@ -232,7 +232,6 @@ static const struct tune_params neoversev2_tunings =
>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
>>>>>  | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),        /* tune_flags.  */
>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h 
>>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>> index d8c82255378..f62ae67d355 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings =
>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
>>>>> &generic_prefetch_tune,
>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h 
>>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>> index 7f050501ede..0233baf5e34 100644
>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings =
>>>>> tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>  | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>  | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>  | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
>>>>> &generic_prefetch_tune,
>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c 
>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>> index 762805ff54b..c334b7a6875 100644
>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>> @@ -15,4 +15,4 @@
>>>>>  so we vectorize the offset calculation.  This means that the
>>>>>  64-bit version needs two copies.  */
>>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, 
>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c 
>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>> index f0ea58e38e2..94cc63049bc 100644
>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>> @@ -15,4 +15,4 @@
>>>>>  so we vectorize the offset calculation.  This means that the
>>>>>  64-bit version needs two copies.  */
>>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], 
>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>>> 
>>>> 
>>>> --
>>>> Richard Biener <rguent...@suse.de>
>>>> SUSE Software Solutions Germany GmbH,
>>>> Frankenstrasse 146, 90461 Nuernberg, Germany;
>>>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> 
>

smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

Reply via email to