> On 5 Dec 2024, at 20:07, Richard Sandiford <richard.sandif...@arm.com> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>> On 5 Dec 2024, at 11:44, Richard Biener <rguent...@suse.de> wrote:
>>> 
>>> External email: Use caution opening links or attachments
>>> 
>>> 
>>> On Thu, 5 Dec 2024, Jennifer Schmitz wrote:
>>> 
>>>> 
>>>> 
>>>>> On 17 Oct 2024, at 19:23, Richard Sandiford <richard.sandif...@arm.com> 
>>>>> wrote:
>>>>> 
>>>>> External email: Use caution opening links or attachments
>>>>> 
>>>>> 
>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>>>>> [...]
>>>>>> Looking at the diff of the vect dumps (below is a section of the diff 
>>>>>> for strided_store_2.c), it seemed odd that vec_to_scalar operations cost 
>>>>>> 0 now, instead of the previous cost of 2:
>>>>>> 
>>>>>> +strided_store_1.c:38:151: note:    === vectorizable_operation ===
>>>>>> +strided_store_1.c:38:151: note:    vect_model_simple_cost: inside_cost 
>>>>>> = 1, prologue_cost  = 0 .
>>>>>> +strided_store_1.c:38:151: note:   ==> examining statement: *_6 = _7;
>>>>>> +strided_store_1.c:38:151: note:   vect_is_simple_use: operand _3 + 
>>>>>> 1.0e+0, type of def:    internal
>>>>>> +strided_store_1.c:38:151: note:   Vectorizing an unaligned access.
>>>>>> +Applying pattern match.pd:236, generic-match-9.cc:4128
>>>>>> +Applying pattern match.pd:5285, generic-match-10.cc:4234
>>>>>> +strided_store_1.c:38:151: note:   vect_model_store_cost: inside_cost = 
>>>>>> 12, prologue_cost = 0 .
>>>>>> *_2 1 times unaligned_load (misalign -1) costs 1 in body
>>>>>> -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue
>>>>>> _3 + 1.0e+0 1 times vector_stmt costs 1 in body
>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>> +<unknown> 1 times vector_load costs 1 in prologue
>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>> -_7 1 times vec_to_scalar costs 2 in body
>>>>>> +_7 1 times vec_to_scalar costs 0 in body
>>>>>> _7 1 times scalar_store costs 1 in body
>>>>>> 
>>>>>> Although the aarch64_use_new_vector_costs_p flag was used in multiple 
>>>>>> places in aarch64.cc, the location that causes this behavior is this one:
>>>>>> unsigned
>>>>>> aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
>>>>>>                                   stmt_vec_info stmt_info, slp_tree,
>>>>>>                                   tree vectype, int misalign,
>>>>>>                                   vect_cost_model_location where)
>>>>>> {
>>>>>> [...]
>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
>>>>>>   of just looking at KIND.  */
>>>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>>> +  if (stmt_info)
>>>>>>  {
>>>>>>    /* If we scalarize a strided store, the vectorizer costs one
>>>>>>       vec_to_scalar for each element.  However, we can store the first
>>>>>>       element using an FP store without a separate extract step.  */
>>>>>>    if (vect_is_store_elt_extraction (kind, stmt_info))
>>>>>>      count -= 1;
>>>>>> 
>>>>>>    stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
>>>>>>                                                    stmt_info, stmt_cost);
>>>>>> 
>>>>>>    if (vectype && m_vec_flags)
>>>>>>      stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind,
>>>>>>                                                      stmt_info, vectype,
>>>>>>                                                      where, stmt_cost);
>>>>>>  }
>>>>>> [...]
>>>>>> return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ());
>>>>>> }
>>>>>> 
>>>>>> Previously, for mtune=generic, this function returned a cost of 2 for a 
>>>>>> vec_to_scalar operation in the vect body. Now "if (stmt_info)" is 
>>>>>> entered and "if (vect_is_store_elt_extraction (kind, stmt_info))" 
>>>>>> evaluates to true, which sets the count to 0 and leads to a return value 
>>>>>> of 0.
>>>>> 
>>>>> At the time the code was written, a scalarised store would be costed
>>>>> using one vec_to_scalar call into the backend, with the count parameter
>>>>> set to the number of elements being stored.  The "count -= 1" was
>>>>> supposed to lop off the leading element extraction, since we can store
>>>>> lane 0 as a normal FP store.
>>>>> 
>>>>> The target-independent costing was later reworked so that it costs
>>>>> each operation individually:
>>>>> 
>>>>>            for (i = 0; i < nstores; i++)
>>>>>              {
>>>>>                if (costing_p)
>>>>>                  {
>>>>>                    /* Only need vector extracting when there are more
>>>>>                       than one stores.  */
>>>>>                    if (nstores > 1)
>>>>>                      inside_cost
>>>>>                        += record_stmt_cost (cost_vec, 1, vec_to_scalar,
>>>>>                                             stmt_info, 0, vect_body);
>>>>>                    /* Take a single lane vector type store as scalar
>>>>>                       store to avoid ICE like 110776.  */
>>>>>                    if (VECTOR_TYPE_P (ltype)
>>>>>                        && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
>>>>>                      n_adjacent_stores++;
>>>>>                    else
>>>>>                      inside_cost
>>>>>                        += record_stmt_cost (cost_vec, 1, scalar_store,
>>>>>                                             stmt_info, 0, vect_body);
>>>>>                    continue;
>>>>>                  }
>>>>> 
>>>>> Unfortunately, there's no easy way of telling whether a particular call
>>>>> is part of a group, and if so, which member of the group it is.
>>>>> 
>>>>> I suppose we could give up on the attempt to be (somewhat) accurate
>>>>> and just disable the optimisation.  Or we could restrict it to count > 1,
>>>>> since it might still be useful for gathers and scatters.
>>>> I tried restricting the calls to vect_is_store_elt_extraction to count > 1 
>>>> and it seems to resolve the issue of costing vec_to_scalar operations with 
>>>> 0 (see patch below).
>>>> What are your thoughts on this?
>>> 
>>> Why didn't you pursue instead moving the vec_to_scalar cost together
>>> with the n_adjacent_store handling?
>> When I continued working on this patch, we had already reached stage 3 and I 
>> was hesitant to introduce changes to the middle-end that were not previously 
>> covered by this patch. So I tried if the issue could not be resolved by 
>> making a small change in the backend.
>> If you still advise to use the n_adjacent_store instead, I’m happy to look 
>> into it again.
> 
> If Richard's ok with adjusting vectorizable_store for GCC 15 (which it
> sounds like he is), then I agree that would be better.  Otherwise we'd
> be creating technical debt to clean up for GCC 16.  And it is a regression
> of sorts, so is stage 3 material from that POV.
> 
> (Incidentally, AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS was itself a
> "let's clean this up next stage 1" thing, since we needed to add tuning
> for a new CPU late during the cycle.  But of course, there were other
> priorities when stage 1 actually came around, so it never actually
> happened.  Thanks again for being the one to sort this out.)
Thanks for your feedback. Then I will try to make it work in vectorizable_store.
Best,
Jennifer
> 
> Richard
> 
>> Thanks,
>> Jennifer
>>> 
>>>> Thanks,
>>>> Jennifer
>>>> 
>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and
>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
>>>> default. To that end, the function aarch64_use_new_vector_costs_p and its 
>>>> uses
>>>> were removed. To prevent costing vec_to_scalar operations with 0, as
>>>> described in
>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
>>>> we guarded the call to vect_is_store_elt_extraction in
>>>> aarch64_vector_costs::add_stmt_cost by count > 1.
>>>> 
>>>> Two tests were adjusted due to changes in codegen. In both cases, the
>>>> old code performed loop unrolling once, but the new code does not:
>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>>>> -moverride=tune=none):
>>>> f_int64_t_32:
>>>>       cbz     w3, .L92
>>>>       mov     x4, 0
>>>>       uxtw    x3, w3
>>>> +       cntd    x5
>>>> +       whilelo p7.d, xzr, x3
>>>> +       mov     z29.s, w5
>>>>       mov     z31.s, w2
>>>> -       whilelo p6.d, xzr, x3
>>>> -       mov     x2, x3
>>>> -       index   z30.s, #0, #1
>>>> -       uqdecd  x2
>>>> -       ptrue   p5.b, all
>>>> -       whilelo p7.d, xzr, x2
>>>> +       index   z30.d, #0, #1
>>>> +       ptrue   p6.b, all
>>>>       .p2align 3,,7
>>>> .L94:
>>>> -       ld1d    z27.d, p7/z, [x0, #1, mul vl]
>>>> -       ld1d    z28.d, p6/z, [x0]
>>>> -       movprfx z29, z31
>>>> -       mul     z29.s, p5/m, z29.s, z30.s
>>>> -       incw    x4
>>>> -       uunpklo z0.d, z29.s
>>>> -       uunpkhi z29.d, z29.s
>>>> -       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
>>>> -       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
>>>> -       add     z25.d, z28.d, z25.d
>>>> +       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
>>>> +       movprfx z28, z31
>>>> +       mul     z28.s, p6/m, z28.s, z30.s
>>>> +       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
>>>>       add     z26.d, z27.d, z26.d
>>>> -       st1d    z26.d, p7, [x0, #1, mul vl]
>>>> -       whilelo p7.d, x4, x2
>>>> -       st1d    z25.d, p6, [x0]
>>>> -       incw    z30.s
>>>> -       incb    x0, all, mul #2
>>>> -       whilelo p6.d, x4, x3
>>>> +       st1d    z26.d, p7, [x0, x4, lsl 3]
>>>> +       add     z30.s, z30.s, z29.s
>>>> +       incd    x4
>>>> +       whilelo p7.d, x4, x3
>>>>       b.any   .L94
>>>> .L92:
>>>>       ret
>>>> 
>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic 
>>>> -moverride=tune=none):
>>>> f_int64_t_32:
>>>>       cbz     w3, .L84
>>>> -       addvl   x5, x1, #1
>>>>       mov     x4, 0
>>>>       uxtw    x3, w3
>>>> -       mov     z31.s, w2
>>>> +       cntd    x5
>>>>       whilelo p7.d, xzr, x3
>>>> -       mov     x2, x3
>>>> -       index   z30.s, #0, #1
>>>> -       uqdecd  x2
>>>> -       ptrue   p5.b, all
>>>> -       whilelo p6.d, xzr, x2
>>>> +       mov     z29.s, w5
>>>> +       mov     z31.s, w2
>>>> +       index   z30.d, #0, #1
>>>> +       ptrue   p6.b, all
>>>>       .p2align 3,,7
>>>> .L86:
>>>> -       ld1d    z28.d, p7/z, [x1, x4, lsl 3]
>>>> -       ld1d    z27.d, p6/z, [x5, x4, lsl 3]
>>>> -       movprfx z29, z30
>>>> -       mul     z29.s, p5/m, z29.s, z31.s
>>>> -       add     z28.d, z28.d, #1
>>>> -       uunpklo z26.d, z29.s
>>>> -       st1d    z28.d, p7, [x0, z26.d, lsl 3]
>>>> -       incw    x4
>>>> -       uunpkhi z29.d, z29.s
>>>> +       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
>>>> +       movprfx z28, z30
>>>> +       mul     z28.s, p6/m, z28.s, z31.s
>>>>       add     z27.d, z27.d, #1
>>>> -       whilelo p6.d, x4, x2
>>>> -       st1d    z27.d, p7, [x0, z29.d, lsl 3]
>>>> -       incw    z30.s
>>>> +       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
>>>> +       incd    x4
>>>> +       add     z30.s, z30.s, z29.s
>>>>       whilelo p7.d, x4, x3
>>>>       b.any   .L86
>>>> .L84:
>>>>     ret
>>>> 
>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no
>>>> regression. We also ran SPEC2017 with -mcpu=generic on a Grace machine and 
>>>> saw
>>>> no non-noise impact on performance. We would appreciate help with wider
>>>> benchmarking on other platforms, if necessary.
>>>> OK for mainline?
>>>> 
>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>>> 
>>>> gcc/
>>>>     * config/aarch64/aarch64-tuning-flags.def: Remove
>>>>     use_new_vector_costs as tuning option.
>>>>     * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
>>>>     Remove.
>>>>     (aarch64_vector_costs::add_stmt_cost): Remove use of
>>>>     aarch64_use_new_vector_costs_p and guard call to
>>>>     vect_is_store_elt_extraction with count > 1.
>>>>     (aarch64_vector_costs::finish_cost): Remove use of
>>>>     aarch64_use_new_vector_costs_p.
>>>>     * config/aarch64/tuning_models/cortexx925.h: Remove
>>>>     AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
>>>>     * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
>>>>     * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
>>>>     * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
>>>>     * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
>>>>     * config/aarch64/tuning_models/neoversen2.h: Likewise.
>>>>     * config/aarch64/tuning_models/neoversen3.h: Likewise.
>>>>     * config/aarch64/tuning_models/neoversev1.h: Likewise.
>>>>     * config/aarch64/tuning_models/neoversev2.h: Likewise.
>>>>     * config/aarch64/tuning_models/neoversev3.h: Likewise.
>>>>     * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
>>>> 
>>>> gcc/testsuite/
>>>>     * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
>>>>     * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
>>>> ---
>>>> gcc/config/aarch64/aarch64-tuning-flags.def   |  2 --
>>>> gcc/config/aarch64/aarch64.cc                 | 22 +++++--------------
>>>> gcc/config/aarch64/tuning_models/cortexx925.h |  1 -
>>>> .../aarch64/tuning_models/fujitsu_monaka.h    |  1 -
>>>> .../aarch64/tuning_models/generic_armv8_a.h   |  1 -
>>>> .../aarch64/tuning_models/generic_armv9_a.h   |  1 -
>>>> .../aarch64/tuning_models/neoverse512tvb.h    |  1 -
>>>> gcc/config/aarch64/tuning_models/neoversen2.h |  1 -
>>>> gcc/config/aarch64/tuning_models/neoversen3.h |  1 -
>>>> gcc/config/aarch64/tuning_models/neoversev1.h |  1 -
>>>> gcc/config/aarch64/tuning_models/neoversev2.h |  1 -
>>>> gcc/config/aarch64/tuning_models/neoversev3.h |  1 -
>>>> .../aarch64/tuning_models/neoversev3ae.h      |  1 -
>>>> .../gcc.target/aarch64/sve/strided_load_2.c   |  2 +-
>>>> .../gcc.target/aarch64/sve/strided_store_2.c  |  2 +-
>>>> 15 files changed, 7 insertions(+), 32 deletions(-)
>>>> 
>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>> index 5939602576b..ed345b13ed3 100644
>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", 
>>>> CHEAP_SHIFT_EXTEND)
>>>> 
>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
>>>> 
>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS)
>>>> -
>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", 
>>>> MATCHED_VECTOR_THROUGHPUT)
>>>> 
>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA)
>>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>>>> index 43238aefef2..03806671c97 100644
>>>> --- a/gcc/config/aarch64/aarch64.cc
>>>> +++ b/gcc/config/aarch64/aarch64.cc
>>>> @@ -16566,16 +16566,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo, 
>>>> bool costing_for_scalar)
>>>>  return new aarch64_vector_costs (vinfo, costing_for_scalar);
>>>> }
>>>> 
>>>> -/* Return true if the current CPU should use the new costs defined
>>>> -   in GCC 11.  This should be removed for GCC 12 and above, with the
>>>> -   costs applying to all CPUs instead.  */
>>>> -static bool
>>>> -aarch64_use_new_vector_costs_p ()
>>>> -{
>>>> -  return (aarch64_tune_params.extra_tuning_flags
>>>> -       & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
>>>> -}
>>>> -
>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE.  */
>>>> static const simd_vec_cost *
>>>> aarch64_simd_vec_costs (tree vectype)
>>>> @@ -17494,7 +17484,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>>>> vect_cost_for_stmt kind,
>>>> 
>>>>  /* Do one-time initialization based on the vinfo.  */
>>>>  loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
>>>> -  if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
>>>> +  if (!m_analyzed_vinfo)
>>>>    {
>>>>      if (loop_vinfo)
>>>>     analyze_loop_vinfo (loop_vinfo);
>>>> @@ -17512,12 +17502,12 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>>>> vect_cost_for_stmt kind,
>>>> 
>>>>  /* Try to get a more accurate cost by looking at STMT_INFO instead
>>>>     of just looking at KIND.  */
>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>> +  if (stmt_info)
>>>>    {
>>>>      /* If we scalarize a strided store, the vectorizer costs one
>>>>      vec_to_scalar for each element.  However, we can store the first
>>>>      element using an FP store without a separate extract step.  */
>>>> -      if (vect_is_store_elt_extraction (kind, stmt_info))
>>>> +      if (vect_is_store_elt_extraction (kind, stmt_info) && count > 1)
>>>>     count -= 1;
>>>> 
>>>>      stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
>>>> @@ -17577,7 +17567,7 @@ aarch64_vector_costs::add_stmt_cost (int count, 
>>>> vect_cost_for_stmt kind,
>>>>  else
>>>>    m_num_last_promote_demote = 0;
>>>> 
>>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>> +  if (stmt_info)
>>>>    {
>>>>      /* Account for any extra "embedded" costs that apply additively
>>>>      to the base cost calculated above.  */
>>>> @@ -17938,9 +17928,7 @@ aarch64_vector_costs::finish_cost (const 
>>>> vector_costs *uncast_scalar_costs)
>>>> 
>>>>  auto *scalar_costs
>>>>    = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
>>>> -  if (loop_vinfo
>>>> -      && m_vec_flags
>>>> -      && aarch64_use_new_vector_costs_p ())
>>>> +  if (loop_vinfo && m_vec_flags)
>>>>    {
>>>>      m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>>>                                          m_costs[vect_body]);
>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h 
>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>> index eb9b89984b0..dafea96e924 100644
>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>> @@ -219,7 +219,6 @@ static const struct tune_params cortexx925_tunings =
>>>>  tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>   | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
>>>>  &generic_prefetch_tune,
>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h 
>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>> index 6a098497759..ac001927959 100644
>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings =
>>>>  0, /* max_case_values.  */
>>>>  tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>  &generic_prefetch_tune,
>>>>  AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h 
>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>> index 9b1cbfc5bd2..7b534831340 100644
>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>> @@ -183,7 +183,6 @@ static const struct tune_params 
>>>> generic_armv8_a_tunings =
>>>>  tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>   | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>  &generic_prefetch_tune,
>>>>  AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>> index 48353a59939..562ef89c67b 100644
>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>> @@ -249,7 +249,6 @@ static const struct tune_params 
>>>> generic_armv9_a_tunings =
>>>>  0, /* max_case_values.  */
>>>>  tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>  &generic_armv9a_prefetch_tune,
>>>>  AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h 
>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>> index c407b89a22f..fe4f7c10f73 100644
>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>> @@ -156,7 +156,6 @@ static const struct tune_params neoverse512tvb_tunings 
>>>> =
>>>>  0, /* max_case_values.  */
>>>>  tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>  (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>  &generic_prefetch_tune,
>>>>  AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>> index 18199ac206c..56be77423cb 100644
>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings =
>>>>  tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>   | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
>>>>  &generic_prefetch_tune,
>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h 
>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>> index 4da85cfac0d..254ad5e27f8 100644
>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings =
>>>>  tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>   | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>>>>  &generic_prefetch_tune,
>>>>  AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>> index dd9120eee48..c7241cf23d7 100644
>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>> @@ -227,7 +227,6 @@ static const struct tune_params neoversev1_tunings =
>>>>  0, /* max_case_values.  */
>>>>  tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>  (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>> index 1369de73991..96f55940649 100644
>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>> @@ -232,7 +232,6 @@ static const struct tune_params neoversev2_tunings =
>>>>  tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>   | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
>>>>   | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),        /* tune_flags.  */
>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h 
>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>> index d8c82255378..f62ae67d355 100644
>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings =
>>>>  tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>   | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
>>>>  &generic_prefetch_tune,
>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h 
>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>> index 7f050501ede..0233baf5e34 100644
>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings =
>>>>  tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
>>>>  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>   | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>> -   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),     /* tune_flags.  */
>>>>  &generic_prefetch_tune,
>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c 
>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>> index 762805ff54b..c334b7a6875 100644
>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>> @@ -15,4 +15,4 @@
>>>>   so we vectorize the offset calculation.  This means that the
>>>>   64-bit version needs two copies.  */
>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, 
>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, 
>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c 
>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>> index f0ea58e38e2..94cc63049bc 100644
>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>> @@ -15,4 +15,4 @@
>>>>   so we vectorize the offset calculation.  This means that the
>>>>   64-bit version needs two copies.  */
>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], 
>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], 
>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>> 
>>> 
>>> --
>>> Richard Biener <rguent...@suse.de>
>>> SUSE Software Solutions Germany GmbH,
>>> Frankenstrasse 146, 90461 Nuernberg, Germany;
>>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to