> On 5 Dec 2024, at 20:07, Richard Sandiford <richard.sandif...@arm.com> wrote: > > External email: Use caution opening links or attachments > > > Jennifer Schmitz <jschm...@nvidia.com> writes: >>> On 5 Dec 2024, at 11:44, Richard Biener <rguent...@suse.de> wrote: >>> >>> External email: Use caution opening links or attachments >>> >>> >>> On Thu, 5 Dec 2024, Jennifer Schmitz wrote: >>> >>>> >>>> >>>>> On 17 Oct 2024, at 19:23, Richard Sandiford <richard.sandif...@arm.com> >>>>> wrote: >>>>> >>>>> External email: Use caution opening links or attachments >>>>> >>>>> >>>>> Jennifer Schmitz <jschm...@nvidia.com> writes: >>>>>> [...] >>>>>> Looking at the diff of the vect dumps (below is a section of the diff >>>>>> for strided_store_2.c), it seemed odd that vec_to_scalar operations cost >>>>>> 0 now, instead of the previous cost of 2: >>>>>> >>>>>> +strided_store_1.c:38:151: note: === vectorizable_operation === >>>>>> +strided_store_1.c:38:151: note: vect_model_simple_cost: inside_cost >>>>>> = 1, prologue_cost = 0 . >>>>>> +strided_store_1.c:38:151: note: ==> examining statement: *_6 = _7; >>>>>> +strided_store_1.c:38:151: note: vect_is_simple_use: operand _3 + >>>>>> 1.0e+0, type of def: internal >>>>>> +strided_store_1.c:38:151: note: Vectorizing an unaligned access. >>>>>> +Applying pattern match.pd:236, generic-match-9.cc:4128 >>>>>> +Applying pattern match.pd:5285, generic-match-10.cc:4234 >>>>>> +strided_store_1.c:38:151: note: vect_model_store_cost: inside_cost = >>>>>> 12, prologue_cost = 0 . >>>>>> *_2 1 times unaligned_load (misalign -1) costs 1 in body >>>>>> -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue >>>>>> _3 + 1.0e+0 1 times vector_stmt costs 1 in body >>>>>> -_7 1 times vec_to_scalar costs 2 in body >>>>>> +<unknown> 1 times vector_load costs 1 in prologue >>>>>> +_7 1 times vec_to_scalar costs 0 in body >>>>>> _7 1 times scalar_store costs 1 in body >>>>>> -_7 1 times vec_to_scalar costs 2 in body >>>>>> +_7 1 times vec_to_scalar costs 0 in body >>>>>> _7 1 times scalar_store costs 1 in body >>>>>> -_7 1 times vec_to_scalar costs 2 in body >>>>>> +_7 1 times vec_to_scalar costs 0 in body >>>>>> _7 1 times scalar_store costs 1 in body >>>>>> -_7 1 times vec_to_scalar costs 2 in body >>>>>> +_7 1 times vec_to_scalar costs 0 in body >>>>>> _7 1 times scalar_store costs 1 in body >>>>>> >>>>>> Although the aarch64_use_new_vector_costs_p flag was used in multiple >>>>>> places in aarch64.cc, the location that causes this behavior is this one: >>>>>> unsigned >>>>>> aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, >>>>>> stmt_vec_info stmt_info, slp_tree, >>>>>> tree vectype, int misalign, >>>>>> vect_cost_model_location where) >>>>>> { >>>>>> [...] >>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead >>>>>> of just looking at KIND. */ >>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >>>>>> + if (stmt_info) >>>>>> { >>>>>> /* If we scalarize a strided store, the vectorizer costs one >>>>>> vec_to_scalar for each element. However, we can store the first >>>>>> element using an FP store without a separate extract step. */ >>>>>> if (vect_is_store_elt_extraction (kind, stmt_info)) >>>>>> count -= 1; >>>>>> >>>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind, >>>>>> stmt_info, stmt_cost); >>>>>> >>>>>> if (vectype && m_vec_flags) >>>>>> stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind, >>>>>> stmt_info, vectype, >>>>>> where, stmt_cost); >>>>>> } >>>>>> [...] >>>>>> return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ()); >>>>>> } >>>>>> >>>>>> Previously, for mtune=generic, this function returned a cost of 2 for a >>>>>> vec_to_scalar operation in the vect body. Now "if (stmt_info)" is >>>>>> entered and "if (vect_is_store_elt_extraction (kind, stmt_info))" >>>>>> evaluates to true, which sets the count to 0 and leads to a return value >>>>>> of 0. >>>>> >>>>> At the time the code was written, a scalarised store would be costed >>>>> using one vec_to_scalar call into the backend, with the count parameter >>>>> set to the number of elements being stored. The "count -= 1" was >>>>> supposed to lop off the leading element extraction, since we can store >>>>> lane 0 as a normal FP store. >>>>> >>>>> The target-independent costing was later reworked so that it costs >>>>> each operation individually: >>>>> >>>>> for (i = 0; i < nstores; i++) >>>>> { >>>>> if (costing_p) >>>>> { >>>>> /* Only need vector extracting when there are more >>>>> than one stores. */ >>>>> if (nstores > 1) >>>>> inside_cost >>>>> += record_stmt_cost (cost_vec, 1, vec_to_scalar, >>>>> stmt_info, 0, vect_body); >>>>> /* Take a single lane vector type store as scalar >>>>> store to avoid ICE like 110776. */ >>>>> if (VECTOR_TYPE_P (ltype) >>>>> && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) >>>>> n_adjacent_stores++; >>>>> else >>>>> inside_cost >>>>> += record_stmt_cost (cost_vec, 1, scalar_store, >>>>> stmt_info, 0, vect_body); >>>>> continue; >>>>> } >>>>> >>>>> Unfortunately, there's no easy way of telling whether a particular call >>>>> is part of a group, and if so, which member of the group it is. >>>>> >>>>> I suppose we could give up on the attempt to be (somewhat) accurate >>>>> and just disable the optimisation. Or we could restrict it to count > 1, >>>>> since it might still be useful for gathers and scatters. >>>> I tried restricting the calls to vect_is_store_elt_extraction to count > 1 >>>> and it seems to resolve the issue of costing vec_to_scalar operations with >>>> 0 (see patch below). >>>> What are your thoughts on this? >>> >>> Why didn't you pursue instead moving the vec_to_scalar cost together >>> with the n_adjacent_store handling? >> When I continued working on this patch, we had already reached stage 3 and I >> was hesitant to introduce changes to the middle-end that were not previously >> covered by this patch. So I tried if the issue could not be resolved by >> making a small change in the backend. >> If you still advise to use the n_adjacent_store instead, I’m happy to look >> into it again. > > If Richard's ok with adjusting vectorizable_store for GCC 15 (which it > sounds like he is), then I agree that would be better. Otherwise we'd > be creating technical debt to clean up for GCC 16. And it is a regression > of sorts, so is stage 3 material from that POV. > > (Incidentally, AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS was itself a > "let's clean this up next stage 1" thing, since we needed to add tuning > for a new CPU late during the cycle. But of course, there were other > priorities when stage 1 actually came around, so it never actually > happened. Thanks again for being the one to sort this out.) Thanks for your feedback. Then I will try to make it work in vectorizable_store. Best, Jennifer > > Richard > >> Thanks, >> Jennifer >>> >>>> Thanks, >>>> Jennifer >>>> >>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and >>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the >>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the >>>> default. To that end, the function aarch64_use_new_vector_costs_p and its >>>> uses >>>> were removed. To prevent costing vec_to_scalar operations with 0, as >>>> described in >>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html, >>>> we guarded the call to vect_is_store_elt_extraction in >>>> aarch64_vector_costs::add_stmt_cost by count > 1. >>>> >>>> Two tests were adjusted due to changes in codegen. In both cases, the >>>> old code performed loop unrolling once, but the new code does not: >>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with >>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic >>>> -moverride=tune=none): >>>> f_int64_t_32: >>>> cbz w3, .L92 >>>> mov x4, 0 >>>> uxtw x3, w3 >>>> + cntd x5 >>>> + whilelo p7.d, xzr, x3 >>>> + mov z29.s, w5 >>>> mov z31.s, w2 >>>> - whilelo p6.d, xzr, x3 >>>> - mov x2, x3 >>>> - index z30.s, #0, #1 >>>> - uqdecd x2 >>>> - ptrue p5.b, all >>>> - whilelo p7.d, xzr, x2 >>>> + index z30.d, #0, #1 >>>> + ptrue p6.b, all >>>> .p2align 3,,7 >>>> .L94: >>>> - ld1d z27.d, p7/z, [x0, #1, mul vl] >>>> - ld1d z28.d, p6/z, [x0] >>>> - movprfx z29, z31 >>>> - mul z29.s, p5/m, z29.s, z30.s >>>> - incw x4 >>>> - uunpklo z0.d, z29.s >>>> - uunpkhi z29.d, z29.s >>>> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3] >>>> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3] >>>> - add z25.d, z28.d, z25.d >>>> + ld1d z27.d, p7/z, [x0, x4, lsl 3] >>>> + movprfx z28, z31 >>>> + mul z28.s, p6/m, z28.s, z30.s >>>> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3] >>>> add z26.d, z27.d, z26.d >>>> - st1d z26.d, p7, [x0, #1, mul vl] >>>> - whilelo p7.d, x4, x2 >>>> - st1d z25.d, p6, [x0] >>>> - incw z30.s >>>> - incb x0, all, mul #2 >>>> - whilelo p6.d, x4, x3 >>>> + st1d z26.d, p7, [x0, x4, lsl 3] >>>> + add z30.s, z30.s, z29.s >>>> + incd x4 >>>> + whilelo p7.d, x4, x3 >>>> b.any .L94 >>>> .L92: >>>> ret >>>> >>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with >>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic >>>> -moverride=tune=none): >>>> f_int64_t_32: >>>> cbz w3, .L84 >>>> - addvl x5, x1, #1 >>>> mov x4, 0 >>>> uxtw x3, w3 >>>> - mov z31.s, w2 >>>> + cntd x5 >>>> whilelo p7.d, xzr, x3 >>>> - mov x2, x3 >>>> - index z30.s, #0, #1 >>>> - uqdecd x2 >>>> - ptrue p5.b, all >>>> - whilelo p6.d, xzr, x2 >>>> + mov z29.s, w5 >>>> + mov z31.s, w2 >>>> + index z30.d, #0, #1 >>>> + ptrue p6.b, all >>>> .p2align 3,,7 >>>> .L86: >>>> - ld1d z28.d, p7/z, [x1, x4, lsl 3] >>>> - ld1d z27.d, p6/z, [x5, x4, lsl 3] >>>> - movprfx z29, z30 >>>> - mul z29.s, p5/m, z29.s, z31.s >>>> - add z28.d, z28.d, #1 >>>> - uunpklo z26.d, z29.s >>>> - st1d z28.d, p7, [x0, z26.d, lsl 3] >>>> - incw x4 >>>> - uunpkhi z29.d, z29.s >>>> + ld1d z27.d, p7/z, [x1, x4, lsl 3] >>>> + movprfx z28, z30 >>>> + mul z28.s, p6/m, z28.s, z31.s >>>> add z27.d, z27.d, #1 >>>> - whilelo p6.d, x4, x2 >>>> - st1d z27.d, p7, [x0, z29.d, lsl 3] >>>> - incw z30.s >>>> + st1d z27.d, p7, [x0, z28.d, uxtw 3] >>>> + incd x4 >>>> + add z30.s, z30.s, z29.s >>>> whilelo p7.d, x4, x3 >>>> b.any .L86 >>>> .L84: >>>> ret >>>> >>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no >>>> regression. We also ran SPEC2017 with -mcpu=generic on a Grace machine and >>>> saw >>>> no non-noise impact on performance. We would appreciate help with wider >>>> benchmarking on other platforms, if necessary. >>>> OK for mainline? >>>> >>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> >>>> >>>> gcc/ >>>> * config/aarch64/aarch64-tuning-flags.def: Remove >>>> use_new_vector_costs as tuning option. >>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p): >>>> Remove. >>>> (aarch64_vector_costs::add_stmt_cost): Remove use of >>>> aarch64_use_new_vector_costs_p and guard call to >>>> vect_is_store_elt_extraction with count > 1. >>>> (aarch64_vector_costs::finish_cost): Remove use of >>>> aarch64_use_new_vector_costs_p. >>>> * config/aarch64/tuning_models/cortexx925.h: Remove >>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS. >>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise. >>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise. >>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise. >>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise. >>>> * config/aarch64/tuning_models/neoversen2.h: Likewise. >>>> * config/aarch64/tuning_models/neoversen3.h: Likewise. >>>> * config/aarch64/tuning_models/neoversev1.h: Likewise. >>>> * config/aarch64/tuning_models/neoversev2.h: Likewise. >>>> * config/aarch64/tuning_models/neoversev3.h: Likewise. >>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise. >>>> >>>> gcc/testsuite/ >>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome. >>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise. >>>> --- >>>> gcc/config/aarch64/aarch64-tuning-flags.def | 2 -- >>>> gcc/config/aarch64/aarch64.cc | 22 +++++-------------- >>>> gcc/config/aarch64/tuning_models/cortexx925.h | 1 - >>>> .../aarch64/tuning_models/fujitsu_monaka.h | 1 - >>>> .../aarch64/tuning_models/generic_armv8_a.h | 1 - >>>> .../aarch64/tuning_models/generic_armv9_a.h | 1 - >>>> .../aarch64/tuning_models/neoverse512tvb.h | 1 - >>>> gcc/config/aarch64/tuning_models/neoversen2.h | 1 - >>>> gcc/config/aarch64/tuning_models/neoversen3.h | 1 - >>>> gcc/config/aarch64/tuning_models/neoversev1.h | 1 - >>>> gcc/config/aarch64/tuning_models/neoversev2.h | 1 - >>>> gcc/config/aarch64/tuning_models/neoversev3.h | 1 - >>>> .../aarch64/tuning_models/neoversev3ae.h | 1 - >>>> .../gcc.target/aarch64/sve/strided_load_2.c | 2 +- >>>> .../gcc.target/aarch64/sve/strided_store_2.c | 2 +- >>>> 15 files changed, 7 insertions(+), 32 deletions(-) >>>> >>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def >>>> b/gcc/config/aarch64/aarch64-tuning-flags.def >>>> index 5939602576b..ed345b13ed3 100644 >>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def >>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def >>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", >>>> CHEAP_SHIFT_EXTEND) >>>> >>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS) >>>> >>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS) >>>> - >>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", >>>> MATCHED_VECTOR_THROUGHPUT) >>>> >>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA) >>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc >>>> index 43238aefef2..03806671c97 100644 >>>> --- a/gcc/config/aarch64/aarch64.cc >>>> +++ b/gcc/config/aarch64/aarch64.cc >>>> @@ -16566,16 +16566,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo, >>>> bool costing_for_scalar) >>>> return new aarch64_vector_costs (vinfo, costing_for_scalar); >>>> } >>>> >>>> -/* Return true if the current CPU should use the new costs defined >>>> - in GCC 11. This should be removed for GCC 12 and above, with the >>>> - costs applying to all CPUs instead. */ >>>> -static bool >>>> -aarch64_use_new_vector_costs_p () >>>> -{ >>>> - return (aarch64_tune_params.extra_tuning_flags >>>> - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS); >>>> -} >>>> - >>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE. */ >>>> static const simd_vec_cost * >>>> aarch64_simd_vec_costs (tree vectype) >>>> @@ -17494,7 +17484,7 @@ aarch64_vector_costs::add_stmt_cost (int count, >>>> vect_cost_for_stmt kind, >>>> >>>> /* Do one-time initialization based on the vinfo. */ >>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); >>>> - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ()) >>>> + if (!m_analyzed_vinfo) >>>> { >>>> if (loop_vinfo) >>>> analyze_loop_vinfo (loop_vinfo); >>>> @@ -17512,12 +17502,12 @@ aarch64_vector_costs::add_stmt_cost (int count, >>>> vect_cost_for_stmt kind, >>>> >>>> /* Try to get a more accurate cost by looking at STMT_INFO instead >>>> of just looking at KIND. */ >>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >>>> + if (stmt_info) >>>> { >>>> /* If we scalarize a strided store, the vectorizer costs one >>>> vec_to_scalar for each element. However, we can store the first >>>> element using an FP store without a separate extract step. */ >>>> - if (vect_is_store_elt_extraction (kind, stmt_info)) >>>> + if (vect_is_store_elt_extraction (kind, stmt_info) && count > 1) >>>> count -= 1; >>>> >>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind, >>>> @@ -17577,7 +17567,7 @@ aarch64_vector_costs::add_stmt_cost (int count, >>>> vect_cost_for_stmt kind, >>>> else >>>> m_num_last_promote_demote = 0; >>>> >>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >>>> + if (stmt_info) >>>> { >>>> /* Account for any extra "embedded" costs that apply additively >>>> to the base cost calculated above. */ >>>> @@ -17938,9 +17928,7 @@ aarch64_vector_costs::finish_cost (const >>>> vector_costs *uncast_scalar_costs) >>>> >>>> auto *scalar_costs >>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs); >>>> - if (loop_vinfo >>>> - && m_vec_flags >>>> - && aarch64_use_new_vector_costs_p ()) >>>> + if (loop_vinfo && m_vec_flags) >>>> { >>>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >>>> m_costs[vect_body]); >>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h >>>> b/gcc/config/aarch64/tuning_models/cortexx925.h >>>> index eb9b89984b0..dafea96e924 100644 >>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h >>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h >>>> @@ -219,7 +219,6 @@ static const struct tune_params cortexx925_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>> &generic_prefetch_tune, >>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>> index 6a098497759..ac001927959 100644 >>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings = >>>> 0, /* max_case_values. */ >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>> &generic_prefetch_tune, >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>> index 9b1cbfc5bd2..7b534831340 100644 >>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>> @@ -183,7 +183,6 @@ static const struct tune_params >>>> generic_armv8_a_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>> &generic_prefetch_tune, >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>> index 48353a59939..562ef89c67b 100644 >>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>> @@ -249,7 +249,6 @@ static const struct tune_params >>>> generic_armv9_a_tunings = >>>> 0, /* max_case_values. */ >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>> &generic_armv9a_prefetch_tune, >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>> index c407b89a22f..fe4f7c10f73 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>> @@ -156,7 +156,6 @@ static const struct tune_params neoverse512tvb_tunings >>>> = >>>> 0, /* max_case_values. */ >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>> &generic_prefetch_tune, >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h >>>> b/gcc/config/aarch64/tuning_models/neoversen2.h >>>> index 18199ac206c..56be77423cb 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>> &generic_prefetch_tune, >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h >>>> b/gcc/config/aarch64/tuning_models/neoversen3.h >>>> index 4da85cfac0d..254ad5e27f8 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>> &generic_prefetch_tune, >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h >>>> b/gcc/config/aarch64/tuning_models/neoversev1.h >>>> index dd9120eee48..c7241cf23d7 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h >>>> @@ -227,7 +227,6 @@ static const struct tune_params neoversev1_tunings = >>>> 0, /* max_case_values. */ >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h >>>> b/gcc/config/aarch64/tuning_models/neoversev2.h >>>> index 1369de73991..96f55940649 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h >>>> @@ -232,7 +232,6 @@ static const struct tune_params neoversev2_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW >>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h >>>> b/gcc/config/aarch64/tuning_models/neoversev3.h >>>> index d8c82255378..f62ae67d355 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>> &generic_prefetch_tune, >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>> index 7f050501ede..0233baf5e34 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>> &generic_prefetch_tune, >>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>> index 762805ff54b..c334b7a6875 100644 >>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>> @@ -15,4 +15,4 @@ >>>> so we vectorize the offset calculation. This means that the >>>> 64-bit version needs two copies. */ >>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, >>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ >>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ >>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ >>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>> index f0ea58e38e2..94cc63049bc 100644 >>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>> @@ -15,4 +15,4 @@ >>>> so we vectorize the offset calculation. This means that the >>>> 64-bit version needs two copies. */ >>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], >>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ >>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ >>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ >>>> >>> >>> -- >>> Richard Biener <rguent...@suse.de> >>> SUSE Software Solutions Germany GmbH, >>> Frankenstrasse 146, 90461 Nuernberg, Germany; >>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
smime.p7s
Description: S/MIME cryptographic signature