From: "Gautham R. Shenoy" <e...@linux.vnet.ibm.com> On POWER10 systems, the L2 cache is at the SMT4 small core level. The following commits ensure that L2 cache gets correctly discovered and the Last-Level-Cache domain (LLC) is set to the SMT sched-domain.
790a166 powerpc/smp: Parse ibm,thread-groups with multiple properties 1fdc1d6 powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map fbd2b67 powerpc/smp: Rename init_thread_group_l1_cache_map() to make it generic 538abe powerpc/smp: Add support detecting thread-groups sharing L2 cache 0be4763 powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache However, with the LLC now on the SMT sched-domain, we are seeing some regressions in the performance of applications that requires single-threaded performance. The reason for this is as follows: Prior to the change (we call this P9-sched below), the sched-domain hierarchy was: SMT (SMT4) --> CACHE (SMT8)[LLC] --> MC (Hemisphere) --> DIE where the CACHE sched-domain is defined to be the Last Level Cache (LLC). On the upstream kernel, with the aforementioned commmits (P10-sched), the sched-domain hierarchy is: SMT (SMT4)[LLC] --> MC (Hemisphere) --> DIE with the SMT sched-domain as the LLC. When the scheduler tries to wakeup a task, it chooses between the waker-CPU and the wakee's previous-CPU. Suppose this choice is called the "target", then in the target's LLC domain, the scheduler a) tries to find an idle core in the LLC. This helps exploit the SMT folding that the wakee task can benefit from. If an idle core is found, the wakee is woken up on it. b) Failing to find an idle core, the scheduler tries to find an idle CPU in the LLC. This helps minimise the wakeup latency for the wakee since it gets to run on the CPU immediately. c) Failing this, it will wake it up on target CPU. Thus, with P9-sched topology, since the CACHE domain comprises of two SMT4 cores, there is a decent chance that we get an idle core, failing which there is a relatively higher probability of finding an idle CPU among the 8 threads in the domain. However, in P10-sched topology, since the SMT domain is the LLC and it contains only a single SMT4 core, the probability that we find that core to be idle is less. Furthermore, since there are only 4 CPUs to search for an idle CPU, there is lower probability that we can get an idle CPU to wake up the task on. Thus applications which require single threaded performance will end up getting woken up on potentially busy core, even though there are idle cores in the system. To remedy this, this patch proposes that the LLC be moved to the MC level which is a group of cores in one half of the chip. SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE While there is no cache being shared at this level, this is still the level where some amount of cache-snooping takes place and it is relatively faster to access the data from the caches of the cores within this domain. With this change, we no longer see regressions on P10 for applications which require single threaded performance. The patch also improves the tail latencies on schbench and the usecs/op on "perf bench sched pipe" On a 10 core P10 system with 80 CPUs, schbench ============ (https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/) Values : Lower the better. 99th percentile is the tail latency. 99th percentile ~~~~~~~~~~~~~~~~~~ No. messenger threads 5.12-rc4 5.12-rc4 P10-sched MC-LLC ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1 70 us 85 us 2 81 us 101 us 3 92 us 107 us 4 96 us 110 us 5 103 us 123 us 6 3412 us ----> 122 us 7 1490 us 136 us 8 6200 us 3572 us Hackbench ============ (perf bench sched pipe) values: lower the better ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ No. of parallel instances 5.12-rc4 5.12-rc4 P10-sched MC-LLC ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1 24.04 us/op 18.72 us/op 2 24.04 us/op 18.65 us/op 4 24.01 us/op 18.76 us/op 8 24.10 us/op 19.11 us/op ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com> --- arch/powerpc/kernel/smp.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 5a4d59a..c75dbd4 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -976,6 +976,13 @@ static bool has_coregroup_support(void) return coregroup_enabled; } +static int powerpc_mc_flags(void) +{ + if(has_coregroup_support()) + return SD_SHARE_PKG_RESOURCES; + return 0; +} + static const struct cpumask *cpu_mc_mask(int cpu) { return cpu_coregroup_mask(cpu); @@ -986,7 +993,7 @@ static const struct cpumask *cpu_mc_mask(int cpu) { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) }, #endif { shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) }, - { cpu_mc_mask, SD_INIT_NAME(MC) }, + { cpu_mc_mask, powerpc_mc_flags, SD_INIT_NAME(MC) }, { cpu_cpu_mask, SD_INIT_NAME(DIE) }, { NULL, }, }; -- 1.9.4