(Missed cc'ing Cc Peter in the original posting)
On Fri, Apr 02, 2021 at 11:07:54AM +0530, Gautham R. Shenoy wrote: > From: "Gautham R. Shenoy" <e...@linux.vnet.ibm.com> > > On POWER10 systems, the L2 cache is at the SMT4 small core level. The > following commits ensure that L2 cache gets correctly discovered and > the Last-Level-Cache domain (LLC) is set to the SMT sched-domain. > > 790a166 powerpc/smp: Parse ibm,thread-groups with multiple properties > 1fdc1d6 powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map > fbd2b67 powerpc/smp: Rename init_thread_group_l1_cache_map() to make > it generic > 538abe powerpc/smp: Add support detecting thread-groups sharing L2 cache > 0be4763 powerpc/cacheinfo: Print correct cache-sibling map/list for L2 > cache > > However, with the LLC now on the SMT sched-domain, we are seeing some > regressions in the performance of applications that requires > single-threaded performance. The reason for this is as follows: > > Prior to the change (we call this P9-sched below), the sched-domain > hierarchy was: > > SMT (SMT4) --> CACHE (SMT8)[LLC] --> MC (Hemisphere) --> DIE > > where the CACHE sched-domain is defined to be the Last Level Cache (LLC). > > On the upstream kernel, with the aforementioned commmits (P10-sched), > the sched-domain hierarchy is: > > SMT (SMT4)[LLC] --> MC (Hemisphere) --> DIE > > with the SMT sched-domain as the LLC. > > When the scheduler tries to wakeup a task, it chooses between the > waker-CPU and the wakee's previous-CPU. Suppose this choice is called > the "target", then in the target's LLC domain, the scheduler > > a) tries to find an idle core in the LLC. This helps exploit the > SMT folding that the wakee task can benefit from. If an idle > core is found, the wakee is woken up on it. > > b) Failing to find an idle core, the scheduler tries to find an idle > CPU in the LLC. This helps minimise the wakeup latency for the > wakee since it gets to run on the CPU immediately. > > c) Failing this, it will wake it up on target CPU. > > Thus, with P9-sched topology, since the CACHE domain comprises of two > SMT4 cores, there is a decent chance that we get an idle core, failing > which there is a relatively higher probability of finding an idle CPU > among the 8 threads in the domain. > > However, in P10-sched topology, since the SMT domain is the LLC and it > contains only a single SMT4 core, the probability that we find that > core to be idle is less. Furthermore, since there are only 4 CPUs to > search for an idle CPU, there is lower probability that we can get an > idle CPU to wake up the task on. > > Thus applications which require single threaded performance will end > up getting woken up on potentially busy core, even though there are > idle cores in the system. > > To remedy this, this patch proposes that the LLC be moved to the MC > level which is a group of cores in one half of the chip. > > SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE > > While there is no cache being shared at this level, this is still the > level where some amount of cache-snooping takes place and it is > relatively faster to access the data from the caches of the cores > within this domain. With this change, we no longer see regressions on > P10 for applications which require single threaded performance. > > The patch also improves the tail latencies on schbench and the > usecs/op on "perf bench sched pipe" > > On a 10 core P10 system with 80 CPUs, > > schbench > ============ > (https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/) > > Values : Lower the better. > 99th percentile is the tail latency. > > > 99th percentile > ~~~~~~~~~~~~~~~~~~ > No. messenger > threads 5.12-rc4 5.12-rc4 > P10-sched MC-LLC > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > 1 70 us 85 us > 2 81 us 101 us > 3 92 us 107 us > 4 96 us 110 us > 5 103 us 123 us > 6 3412 us ----> 122 us > 7 1490 us 136 us > 8 6200 us 3572 us > > > Hackbench > ============ > (perf bench sched pipe) > values: lower the better > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > No. of > parallel > instances 5.12-rc4 5.12-rc4 > P10-sched MC-LLC > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > 1 24.04 us/op 18.72 us/op > 2 24.04 us/op 18.65 us/op > 4 24.01 us/op 18.76 us/op > 8 24.10 us/op 19.11 us/op > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Signed-off-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com> > --- > arch/powerpc/kernel/smp.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index 5a4d59a..c75dbd4 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -976,6 +976,13 @@ static bool has_coregroup_support(void) > return coregroup_enabled; > } > > +static int powerpc_mc_flags(void) > +{ > + if(has_coregroup_support()) > + return SD_SHARE_PKG_RESOURCES; > + return 0; > +} > + > static const struct cpumask *cpu_mc_mask(int cpu) > { > return cpu_coregroup_mask(cpu); > @@ -986,7 +993,7 @@ static const struct cpumask *cpu_mc_mask(int cpu) > { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) }, > #endif > { shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) }, > - { cpu_mc_mask, SD_INIT_NAME(MC) }, > + { cpu_mc_mask, powerpc_mc_flags, SD_INIT_NAME(MC) }, > { cpu_cpu_mask, SD_INIT_NAME(DIE) }, > { NULL, }, > }; > -- > 1.9.4 >