With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted vCPUs"), scheduler avoids preempted vCPUs to schedule tasks on wakeup. This leads to wrong choice of CPU, which in-turn leads to larger wakeup latencies. Eventually, it leads to performance regression in latency sensitive benchmarks like soltp, schbench etc.
On Powerpc, vcpu_is_preempted only looks at yield_count. If the yield_count is odd, the vCPU is assumed to be preempted. However yield_count is increased whenever LPAR enters CEDE state. So any CPU that has entered CEDE state is assumed to be preempted. Even if vCPU of dedicated LPAR is preempted/donated, it should have right of first-use since they are suppose to own the vCPU. On a Power9 System with 32 cores # lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 8 Core(s) per socket: 1 Socket(s): 16 NUMA node(s): 2 Model: 2.2 (pvr 004e 0202) Model name: POWER9 (architected), altivec supported Hypervisor vendor: pHyp Virtualization type: para L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 10240K NUMA node0 CPU(s): 0-63 NUMA node1 CPU(s): 64-127 # perf stat -a -r 5 ./schbench v5.4 v5.4 + patch Latency percentiles (usec) Latency percentiles (usec) 50.0000th: 45 50.0000th: 39 75.0000th: 62 75.0000th: 53 90.0000th: 71 90.0000th: 67 95.0000th: 77 95.0000th: 76 *99.0000th: 91 *99.0000th: 89 99.5000th: 707 99.5000th: 93 99.9000th: 6920 99.9000th: 118 min=0, max=10048 min=0, max=211 Latency percentiles (usec) Latency percentiles (usec) 50.0000th: 45 50.0000th: 34 75.0000th: 61 75.0000th: 45 90.0000th: 72 90.0000th: 53 95.0000th: 79 95.0000th: 56 *99.0000th: 691 *99.0000th: 61 99.5000th: 3972 99.5000th: 63 99.9000th: 8368 99.9000th: 78 min=0, max=16606 min=0, max=228 Latency percentiles (usec) Latency percentiles (usec) 50.0000th: 45 50.0000th: 34 75.0000th: 61 75.0000th: 45 90.0000th: 71 90.0000th: 53 95.0000th: 77 95.0000th: 57 *99.0000th: 106 *99.0000th: 63 99.5000th: 2364 99.5000th: 68 99.9000th: 7480 99.9000th: 100 min=0, max=10001 min=0, max=134 Latency percentiles (usec) Latency percentiles (usec) 50.0000th: 45 50.0000th: 34 75.0000th: 62 75.0000th: 46 90.0000th: 72 90.0000th: 53 95.0000th: 78 95.0000th: 56 *99.0000th: 93 *99.0000th: 61 99.5000th: 108 99.5000th: 64 99.9000th: 6792 99.9000th: 85 min=0, max=17681 min=0, max=121 Latency percentiles (usec) Latency percentiles (usec) 50.0000th: 46 50.0000th: 33 75.0000th: 62 75.0000th: 44 90.0000th: 73 90.0000th: 51 95.0000th: 79 95.0000th: 54 *99.0000th: 113 *99.0000th: 61 99.5000th: 2724 99.5000th: 64 99.9000th: 6184 99.9000th: 82 min=0, max=9887 min=0, max=121 Performance counter stats for 'system wide' (5 runs): context-switches 43,373 ( +- 0.40% ) 44,597 ( +- 0.55% ) cpu-migrations 1,211 ( +- 5.04% ) 220 ( +- 6.23% ) page-faults 15,983 ( +- 5.21% ) 15,360 ( +- 3.38% ) Waiman Long suggested using static_keys. Reported-by: Parth Shah <pa...@linux.ibm.com> Reported-by: Ihor Pasichnyk <ihor.pasich...@ibm.com> Cc: Parth Shah <pa...@linux.ibm.com> Cc: Ihor Pasichnyk <ihor.pasich...@ibm.com> Cc: Juri Lelli <juri.le...@redhat.com> Cc: Phil Auld <pa...@redhat.com> Cc: Waiman Long <long...@redhat.com> Cc: Gautham R. Shenoy <e...@linux.vnet.ibm.com> Tested-by: Juri Lelli <juri.le...@redhat.com> Ack-by: Waiman Long <long...@redhat.com> Reviewed-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com> Signed-off-by: Srikar Dronamraju <sri...@linux.vnet.ibm.com> --- Changelog v1 (https://patchwork.ozlabs.org/patch/1204190/) ->v3: Code is now under CONFIG_PPC_SPLPAR as it depends on CONFIG_PPC_PSERIES. This was suggested by Waiman Long. arch/powerpc/include/asm/spinlock.h | 5 +++-- arch/powerpc/mm/numa.c | 4 ++++ 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index e9a960e28f3c..de817c25deff 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -35,11 +35,12 @@ #define LOCK_TOKEN 1 #endif -#ifdef CONFIG_PPC_PSERIES +#ifdef CONFIG_PPC_SPLPAR +DECLARE_STATIC_KEY_FALSE(shared_processor); #define vcpu_is_preempted vcpu_is_preempted static inline bool vcpu_is_preempted(int cpu) { - if (!firmware_has_feature(FW_FEATURE_SPLPAR)) + if (!static_branch_unlikely(&shared_processor)) return false; return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1); } diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 50d68d21ddcc..ffb971f3a63c 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1568,9 +1568,13 @@ int prrn_is_enabled(void) return prrn_enabled; } +DEFINE_STATIC_KEY_FALSE(shared_processor); +EXPORT_SYMBOL_GPL(shared_processor); + void __init shared_proc_topology_init(void) { if (lppaca_shared_proc(get_lppaca())) { + static_branch_enable(&shared_processor); bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask), nr_cpumask_bits); numa_update_cpu_topology(false); -- 2.18.1