Hi Richard,

> On 1 Oct 2024, at 13:35, Richard Sandiford <richard.sandif...@arm.com> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>> Hi all,
>> I'd like to use a value of 64 bytes for the L1 cache size for Armv9-A
>> generic tuning.
>> As described in g:9a99559a478111f7fbeec29bd78344df7651c707 this value is used
>> to set the std::hardware_destructive_interference_size value which we want to
>> be not overly large when running concurrent applications on large core-count
>> systems.
>> 
>> The generic value for Armv8-A systems and the port baseline is 256 bytes
>> because that's what the A64FX CPU has, as set de-facto in
>> aarch64_override_options_internal.
>> 
>> But for Armv9-A CPUs as far as I know there isn't anything larger
>> than 64 bytes, so we should be able to use the smaller value here and reduce
>> the size of concurrent structs that use
>> std::hardware_destructive_interference_size to pad their fields.
>> 
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>> 
>> WDYT?
> 
> I suppose doing this for a form of generic tuning goes somewhat against:
> 
>  /* Set up parameters to be used in prefetching algorithm.  Do not
>     override the defaults unless we are tuning for a core we have
>     researched values for.  */
> 

Yeah, I think the intent of that comment is for heuristics that guide the SW 
prefetch emission.
I think it was added before the introduction of 
std::hardware_destructive_interference_size and its
dependence on the L1 cache line size.


> But I agree it doesn't make conceptual sense to constrain a known-to-be
> Armv9-A core based on values that are only needed for Armv8-A cores.
> So no objection from me FWIW.
> 

Thanks, I’ll commit this version but we may want to refactor things a bit in 
that area in the future.
It’s also something to consider when adding new Neoverse core support.


> I think we would need to do something else if there are ever Armv9-A
> cores with different L1 cache line sizes though.  E.g. if a new Armv9-A
> core has a 128-byte cache line, we would probably want to set the range
> to [64, 128] rather than the patch's [64, 64], and rather than the
> current [64, 256].

From what I can tell it would be catastrophically bad to have a smaller value 
than the real hardware
because of introduction of false sharing. Having a larger value than the HW is 
not as bad, but suboptimal.
So the compiler would have to pick the largest of the possible values IMO.
Maybe there’s some clever scheme we could invent in 
aarch64_override_options_internal to go through
all the CPUs of the specified architecture and above and take the maximum of 
their sizes automatically.
These values are known to the compiler statically after all.
Thanks,
Kyrill


> 
> Thanks,
> Richard
> 
> 
>> Thanks,
>> Kyrill
>> 
>> 
>>        * config/aarch64/tuning_models/generic_armv9_a.h
>>        (generic_armv9a_prefetch_tune): Define.
>>        (generic_armv9_a_tunings): Use the above.
>> 
>> From 93aa4ec4d972dfff02ccd6751af160ed243aa750 Mon Sep 17 00:00:00 2001
>> From: Kyrylo Tkachov <ktkac...@nvidia.com>
>> Date: Fri, 20 Sep 2024 05:11:39 -0700
>> Subject: [PATCH] aarch64: Set Armv9-A generic L1 cache line size to 64 bytes
>> 
>> I'd like to use a value of 64 bytes for the L1 cache size for Armv9-A
>> generic tuning.
>> As described in g:9a99559a478111f7fbeec29bd78344df7651c707 this value is used
>> to set the std::hardware_destructive_interference_size value which we want to
>> be not overly large when running concurrent applications on large core-count
>> systems.
>> 
>> The generic value for Armv8-A systems and the port baseline is 256 bytes
>> because that's what the A64FX CPU has, as set de-facto in
>> aarch64_override_options_internal.
>> 
>> But for Armv9-A CPUs as far as I know there isn't anything larger
>> than 64 bytes, so we should be able to use the smaller value here and reduce
>> the size of concurrent structs that use
>> std::hardware_destructive_interference_size to pad their fields.
>> 
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>> 
>>      * config/aarch64/tuning_models/generic_armv9_a.h
>>      (generic_armv9a_prefetch_tune): Define.
>>      (generic_armv9_a_tunings): Use the above.
>> ---
>> gcc/config/aarch64/tuning_models/generic_armv9_a.h | 14 +++++++++++++-
>> 1 file changed, 13 insertions(+), 1 deletion(-)
>> 
>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h 
>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>> index 999985ed40f..76b3e4c9cf7 100644
>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>> @@ -207,6 +207,18 @@ static const struct cpu_vector_cost 
>> generic_armv9_a_vector_cost =
>>   &generic_armv9_a_vec_issue_info /* issue_info  */
>> };
>> 
>> +/* Generic prefetch settings (which disable prefetch).  */
>> +static const cpu_prefetch_tune generic_armv9a_prefetch_tune =
>> +{
>> +  0,                 /* num_slots  */
>> +  -1,                        /* l1_cache_size  */
>> +  64,                        /* l1_cache_line_size  */
>> +  -1,                        /* l2_cache_size  */
>> +  true,                      /* prefetch_dynamic_strides */
>> +  -1,                        /* minimum_stride */
>> +  -1                 /* default_opt_level  */
>> +};
>> +
>> static const struct tune_params generic_armv9_a_tunings =
>> {
>>   &cortexa76_extra_costs,
>> @@ -239,7 +251,7 @@ static const struct tune_params generic_armv9_a_tunings =
>>   (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>    | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>    | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
>> -  &generic_prefetch_tune,
>> +  &generic_armv9a_prefetch_tune,
>>   AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>>   AARCH64_LDP_STP_POLICY_ALWAYS         /* stp_policy_model.  */
>> };


Reply via email to