CPU Hotplug smt=off operation on a maximum configuration ppc64le system
with 1920 logical CPUs takes more than 59 minutes to complete.

Several attempts made to reduce the time consumption of CPU hotplug
operation is discussed in this thread below:
https://lore.kernel.org/all/y01uwql2y2r69...@li-05afa54c-330e-11b2-a85c-e3f3aa0db1e9.ibm.com/

By applying the solution discussed in the above thread, time taken for
CPU hotplug smt=off operation is brought down from 59m to 32m resulting
in a performance improvement of around 45%.

Though a significant performance improvement is achieved, still 32m for
CPU hotplug (smt=off) operation is a large number. To bring it down further,
we analysed the blocking time overhead in CPU hotplug using the offcputime
bcc script. The script outputs the stack-traces of the tasks that were
blocked and the total duration for which the tasks were blocked, to
identify the areas of improvement.

offcputime bcc script:
https://github.com/iovisor/bcc/blob/master/tools/offcputime.py

Below is one of the call-stacks that accounted for most of the blocking
time overhead as reported by offcputime bcc script for CPU offline
operation,

    finish_task_switch
    __schedule
    schedule
    schedule_timeout
    wait_for_completion
    __wait_rcu_gp
    synchronize_rcu
    cpuidle_uninstall_idle_handler
    powernv_cpuidle_cpu_dead
    cpuhp_invoke_callback
    __cpuhp_invoke_callback_range
    _cpu_down
    cpu_device_down
    cpu_subsys_offline
    device_offline
    online_store
    dev_attr_store
    sysfs_kf_write
    kernfs_fop_write_iter
    vfs_write
    ksys_write
    system_call_exception
    system_call_common
   -                bash (29705)
        5771569  ------------------------>  Duration (us)

>From the above call-stack, it is observed that in
cpuidle_uninstall_idle_handler, synchronize_rcu is accounting for major
chunk of the overhead seen in CPU online and offline operations. This
stack-trace is observed in pseries and powernv systems but not in ACPI
based systems where we don't invoke cpuidle_disable_device during CPU
hotplug offline operation.

Patch that introduces synchronize_rcu in cpuidle_uninstall_idle_handler
442bf3aaf55a ("sched: Let the scheduler see CPU idle states")
is reverted to check for the accounted overhead.

On a machine having 128 logical CPUs with the below configuration,

root@ltc:~# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  4
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        8
Model:               2.3 (pvr 004e 1203)
Model name:          POWER9, altivec supported
CPU max MHz:         3800.0000
CPU min MHz:         2300.0000
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-63
NUMA node8 CPU(s):   64-127
NUMA node250 CPU(s):
NUMA node251 CPU(s):
NUMA node252 CPU(s):
NUMA node253 CPU(s):
NUMA node254 CPU(s):
NUMA node255 CPU(s):

The tabulation below lists the total time taken for the CPU hotplug
offline and online operation in 4 different scenarios:

|-------------------------------------------------------------------------|
| Time take to offline 127 CPUs (niters : 10)                             |
|--------------------------------------------------|---------|------------|
| kernel version                                   | avg (s) | % decrease |
|--------------------------------------------------|---------|------------|
| (1) v6.2.0-rc5                                   | 17.945  | baseline   |
| (2) revert 442bf3aaf55a (remove synchronize_rcu) | 10.259  | 42.831     |
| (3) replace synchronize_rcu with                 |         |            |
|     synchronize_rcu_expedited                    | 10.129  | 43.554     |
|     in cpuidle_uninstall_idle_handler            |         |            |
| (4) enable system-wide rcu_expedited             | 0.842   | 95.304     |
|--------------------------------------------------|---------|------------|

|-------------------------------------------------------------------------|
| Time take to online 127 CPUs (niters : 10)                              |
|-------------------------------------------------------------------------|
| kernel version                                   | avg (s) | % decrease |
|--------------------------------------------------|---------|------------|
| (1) v6.2.0-rc5                                   | 16.474  | baseline   |
| (2) revert 442bf3aaf55a (remove synchronize_rcu) | 12.503  | 24.104     |
| (3) replace synchronize_rcu with                 |         |            |
|     synchronize_rcu_expedited                    | 12.817  | 22.197     |
|     in cpuidle_uninstall_idle_handler            |         |            |
| (4) enable system-wide rcu_expedited             | 0.4983  | 96.975     |
|--------------------------------------------------|---------|------------|

Note: A performance improvement of around 16% for CPU offline operation is
observed on large configuration systems with nCPUs = 1600 as well by
avoiding `synchronize_rcu` in `cpuidle_uninstall_idle_handler`.

It is observed from the above tabulations that synchronize_rcu introduced
in 442bf3aaf55a ("sched: Let the scheduler see CPU idle states") accounts
for around 40% and 24% of the total time taken by the CPU hotplug offline
and online operation respectively, it will be really helpful to get any
guidance from the community on suggestions for optimization here.

Thanks,
Aboorva

Reply via email to