On Thu, Dec 05, 2019 at 02:02:17PM +0530 Srikar Dronamraju wrote:
> With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted
> vCPUs"), scheduler avoids preempted vCPUs to schedule tasks on wakeup.
> This leads to wrong choice of CPU, which in-turn leads to larger wakeup
> latencies. Eventually, it leads to performance regression in latency
> sensitive benchmarks like soltp, schbench etc.
> 
> On Powerpc, vcpu_is_preempted only looks at yield_count. If the
> yield_count is odd, the vCPU is assumed to be preempted. However
> yield_count is increased whenever LPAR enters CEDE state. So any CPU
> that has entered CEDE state is assumed to be preempted.
> 
> Even if vCPU of dedicated LPAR is preempted/donated, it should have
> right of first-use since they are suppose to own the vCPU.
> 
> On a Power9 System with 32 cores
>  # lscpu
> Architecture:        ppc64le
> Byte Order:          Little Endian
> CPU(s):              128
> On-line CPU(s) list: 0-127
> Thread(s) per core:  8
> Core(s) per socket:  1
> Socket(s):           16
> NUMA node(s):        2
> Model:               2.2 (pvr 004e 0202)
> Model name:          POWER9 (architected), altivec supported
> Hypervisor vendor:   pHyp
> Virtualization type: para
> L1d cache:           32K
> L1i cache:           32K
> L2 cache:            512K
> L3 cache:            10240K
> NUMA node0 CPU(s):   0-63
> NUMA node1 CPU(s):   64-127
> 
>   # perf stat -a -r 5 ./schbench
> v5.4                                     v5.4 + patch
> Latency percentiles (usec)               Latency percentiles (usec)
>       50.0000th: 45                           50.0000th: 39
>       75.0000th: 62                           75.0000th: 53
>       90.0000th: 71                           90.0000th: 67
>       95.0000th: 77                           95.0000th: 76
>       *99.0000th: 91                          *99.0000th: 89
>       99.5000th: 707                          99.5000th: 93
>       99.9000th: 6920                         99.9000th: 118
>       min=0, max=10048                        min=0, max=211
> Latency percentiles (usec)               Latency percentiles (usec)
>       50.0000th: 45                           50.0000th: 34
>       75.0000th: 61                           75.0000th: 45
>       90.0000th: 72                           90.0000th: 53
>       95.0000th: 79                           95.0000th: 56
>       *99.0000th: 691                         *99.0000th: 61
>       99.5000th: 3972                         99.5000th: 63
>       99.9000th: 8368                         99.9000th: 78
>       min=0, max=16606                        min=0, max=228
> Latency percentiles (usec)               Latency percentiles (usec)
>       50.0000th: 45                           50.0000th: 34
>       75.0000th: 61                           75.0000th: 45
>       90.0000th: 71                           90.0000th: 53
>       95.0000th: 77                           95.0000th: 57
>       *99.0000th: 106                         *99.0000th: 63
>       99.5000th: 2364                         99.5000th: 68
>       99.9000th: 7480                         99.9000th: 100
>       min=0, max=10001                        min=0, max=134
> Latency percentiles (usec)               Latency percentiles (usec)
>       50.0000th: 45                           50.0000th: 34
>       75.0000th: 62                           75.0000th: 46
>       90.0000th: 72                           90.0000th: 53
>       95.0000th: 78                           95.0000th: 56
>       *99.0000th: 93                          *99.0000th: 61
>       99.5000th: 108                          99.5000th: 64
>       99.9000th: 6792                         99.9000th: 85
>       min=0, max=17681                        min=0, max=121
> Latency percentiles (usec)               Latency percentiles (usec)
>       50.0000th: 46                           50.0000th: 33
>       75.0000th: 62                           75.0000th: 44
>       90.0000th: 73                           90.0000th: 51
>       95.0000th: 79                           95.0000th: 54
>       *99.0000th: 113                         *99.0000th: 61
>       99.5000th: 2724                         99.5000th: 64
>       99.9000th: 6184                         99.9000th: 82
>       min=0, max=9887                         min=0, max=121
> 
>  Performance counter stats for 'system wide' (5 runs):
> 
> context-switches    43,373  ( +-  0.40% )   44,597 ( +-  0.55% )
> cpu-migrations       1,211  ( +-  5.04% )      220 ( +-  6.23% )
> page-faults         15,983  ( +-  5.21% )   15,360 ( +-  3.38% )
> 
> Waiman Long suggested using static_keys.
> 
> Reported-by: Parth Shah <pa...@linux.ibm.com>
> Reported-by: Ihor Pasichnyk <ihor.pasich...@ibm.com>
> Cc: Parth Shah <pa...@linux.ibm.com>
> Cc: Ihor Pasichnyk <ihor.pasich...@ibm.com>
> Cc: Juri Lelli <juri.le...@redhat.com>
> Cc: Phil Auld <pa...@redhat.com>
> Cc: Waiman Long <long...@redhat.com>
> Cc: Gautham R. Shenoy <e...@linux.vnet.ibm.com>
> Tested-by: Juri Lelli <juri.le...@redhat.com>
> Ack-by: Waiman Long <long...@redhat.com>
> Reviewed-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com>
> Signed-off-by: Srikar Dronamraju <sri...@linux.vnet.ibm.com>
> ---
> Changelog v1 (https://patchwork.ozlabs.org/patch/1204190/) ->v3:
> Code is now under CONFIG_PPC_SPLPAR as it depends on CONFIG_PPC_PSERIES.
> This was suggested by Waiman Long.
> 
>  arch/powerpc/include/asm/spinlock.h | 5 +++--
>  arch/powerpc/mm/numa.c              | 4 ++++
>  2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/spinlock.h 
> b/arch/powerpc/include/asm/spinlock.h
> index e9a960e28f3c..de817c25deff 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -35,11 +35,12 @@
>  #define LOCK_TOKEN   1
>  #endif
>  
> -#ifdef CONFIG_PPC_PSERIES
> +#ifdef CONFIG_PPC_SPLPAR
> +DECLARE_STATIC_KEY_FALSE(shared_processor);
>  #define vcpu_is_preempted vcpu_is_preempted
>  static inline bool vcpu_is_preempted(int cpu)
>  {
> -     if (!firmware_has_feature(FW_FEATURE_SPLPAR))
> +     if (!static_branch_unlikely(&shared_processor))
>               return false;
>       return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
>  }
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 50d68d21ddcc..ffb971f3a63c 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1568,9 +1568,13 @@ int prrn_is_enabled(void)
>       return prrn_enabled;
>  }
>  
> +DEFINE_STATIC_KEY_FALSE(shared_processor);
> +EXPORT_SYMBOL_GPL(shared_processor);
> +
>  void __init shared_proc_topology_init(void)
>  {
>       if (lppaca_shared_proc(get_lppaca())) {
> +             static_branch_enable(&shared_processor);
>               bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask),
>                           nr_cpumask_bits);
>               numa_update_cpu_topology(false);
> -- 
> 2.18.1
>

This looks good to me, thanks Srikar.

Acked-by: Phil Auld <pa...@redhat.com>
-- 

Reply via email to