CPU Hotplug smt=off operation on a maximum configuration ppc64le system with 1920 logical CPUs takes more than 59 minutes to complete.
Several attempts made to reduce the time consumption of CPU hotplug operation is discussed in this thread below: https://lore.kernel.org/all/y01uwql2y2r69...@li-05afa54c-330e-11b2-a85c-e3f3aa0db1e9.ibm.com/ By applying the solution discussed in the above thread, time taken for CPU hotplug smt=off operation is brought down from 59m to 32m resulting in a performance improvement of around 45%. Though a significant performance improvement is achieved, still 32m for CPU hotplug (smt=off) operation is a large number. To bring it down further, we analysed the blocking time overhead in CPU hotplug using the offcputime bcc script. The script outputs the stack-traces of the tasks that were blocked and the total duration for which the tasks were blocked, to identify the areas of improvement. offcputime bcc script: https://github.com/iovisor/bcc/blob/master/tools/offcputime.py Below is one of the call-stacks that accounted for most of the blocking time overhead as reported by offcputime bcc script for CPU offline operation, finish_task_switch __schedule schedule schedule_timeout wait_for_completion __wait_rcu_gp synchronize_rcu cpuidle_uninstall_idle_handler powernv_cpuidle_cpu_dead cpuhp_invoke_callback __cpuhp_invoke_callback_range _cpu_down cpu_device_down cpu_subsys_offline device_offline online_store dev_attr_store sysfs_kf_write kernfs_fop_write_iter vfs_write ksys_write system_call_exception system_call_common - bash (29705) 5771569 ------------------------> Duration (us) >From the above call-stack, it is observed that in cpuidle_uninstall_idle_handler, synchronize_rcu is accounting for major chunk of the overhead seen in CPU online and offline operations. This stack-trace is observed in pseries and powernv systems but not in ACPI based systems where we don't invoke cpuidle_disable_device during CPU hotplug offline operation. Patch that introduces synchronize_rcu in cpuidle_uninstall_idle_handler 442bf3aaf55a ("sched: Let the scheduler see CPU idle states") is reverted to check for the accounted overhead. On a machine having 128 logical CPUs with the below configuration, root@ltc:~# lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 4 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 8 Model: 2.3 (pvr 004e 1203) Model name: POWER9, altivec supported CPU max MHz: 3800.0000 CPU min MHz: 2300.0000 L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 10240K NUMA node0 CPU(s): 0-63 NUMA node8 CPU(s): 64-127 NUMA node250 CPU(s): NUMA node251 CPU(s): NUMA node252 CPU(s): NUMA node253 CPU(s): NUMA node254 CPU(s): NUMA node255 CPU(s): The tabulation below lists the total time taken for the CPU hotplug offline and online operation in 4 different scenarios: |-------------------------------------------------------------------------| | Time take to offline 127 CPUs (niters : 10) | |--------------------------------------------------|---------|------------| | kernel version | avg (s) | % decrease | |--------------------------------------------------|---------|------------| | (1) v6.2.0-rc5 | 17.945 | baseline | | (2) revert 442bf3aaf55a (remove synchronize_rcu) | 10.259 | 42.831 | | (3) replace synchronize_rcu with | | | | synchronize_rcu_expedited | 10.129 | 43.554 | | in cpuidle_uninstall_idle_handler | | | | (4) enable system-wide rcu_expedited | 0.842 | 95.304 | |--------------------------------------------------|---------|------------| |-------------------------------------------------------------------------| | Time take to online 127 CPUs (niters : 10) | |-------------------------------------------------------------------------| | kernel version | avg (s) | % decrease | |--------------------------------------------------|---------|------------| | (1) v6.2.0-rc5 | 16.474 | baseline | | (2) revert 442bf3aaf55a (remove synchronize_rcu) | 12.503 | 24.104 | | (3) replace synchronize_rcu with | | | | synchronize_rcu_expedited | 12.817 | 22.197 | | in cpuidle_uninstall_idle_handler | | | | (4) enable system-wide rcu_expedited | 0.4983 | 96.975 | |--------------------------------------------------|---------|------------| Note: A performance improvement of around 16% for CPU offline operation is observed on large configuration systems with nCPUs = 1600 as well by avoiding `synchronize_rcu` in `cpuidle_uninstall_idle_handler`. It is observed from the above tabulations that synchronize_rcu introduced in 442bf3aaf55a ("sched: Let the scheduler see CPU idle states") accounts for around 40% and 24% of the total time taken by the CPU hotplug offline and online operation respectively, it will be really helpful to get any guidance from the community on suggestions for optimization here. Thanks, Aboorva