Abstract: ========= The Linux task scheduler tries to find an idle cpu for a wakee task thereby lowering the wakeup latency as much as possible. The process of determining if a cpu is idle or not has evolved over time. Currently, a cpu is considered idle if - there are no task running or enqueued to the runqueue of the cpu and - in while running inside a guest, a cpu is not yielded (determined via available_idle_cpu())
While inside the guest, there is no way to deterministically predict if a vCPU that has been yielded/ceded to the hypervisor can be gotten back. Hence currently the scheduler considers such CEDEd vCPU as not "available" idle and would instead pick other busy CPUs for waking up the wakee task. In this patch-set we try to further classify idle cpus as instantly available or not. This is achieved by taking hint from the hypervisor by quering if the vCPU will be scheduled instantly or not. In most cases, scheduler prefers prev_cpu of a waking task unless it is busy. In this patchset, the hypervisor uses this information to figure out if the prev_cpu used by the task (of the corresponding vCPU) is idle or not, and passes this information to the guest. Interface: =========== This patchset introduces a new HCALL named H_IDLE_HINT for the guest to query if a vCPU can be dispatched quickly or not. This is currently a crude interface to demonstrate the efficacy of this method. We are looking for feedback on any other mechanisms of obtainining a hint from the hypervisor where the hint is still relevant. But this patch series tries to emphasis the possible optimization for task wakeup-latency and is open to accept any interface/architecture. Internal working: ======== The code-flow of the current implementation is as follow: - GuestOS scheduler searches for idle cpu using `avaialable_idle_cpu()` which also looks if a vcpu_is_preempted() to see if vCPU is yielded or not. - If vCPU is yielded, then the GuestOS will additionally make hcall H_IDLE_HINT to find if a vCPU can be scheduled instantly or not. - The hypervisor services hcall by first finding the corresponding task-p of the vCPU and returns 1 if the task_cpu(p) is avaialable_idle_cpu or running SCHED_IDLE task. Else returns 0. - GuestOS takes up this hint and considers the vCPU as idle if the hint from hypervisor has value == 1. The patch-set is based on v5.11 kernel. Results: ======== - Baseline kernel = v5.11 - Patched kernel = v5.11 + this patch-set Setup: All the results are taken on IBM POWER9 baremetal system running patched kernel. This system consists of 2 NUMA nodes, 22 cores per socket with SMT-4 mode. 2 KVM guests are created sharing same set of physical CPUs, and each KVM has identical CPU topology of 40 CPUs, 10 cores with SMT-4 support. Scenarios: ---------- 1. Under-commit case: Only one KVM is active at a time. - Baseline (v5.11): $> schbench -m 20 -t 2 -r 30 Latency percentiles (usec) 50.0000th: 67 75.0000th: 83 90.0000th: 115 95.0000th: 352 *99.0000th: 2260 <----- 99.5000th: 3580 99.9000th: 7128 min=0, max=9927 - With patch (v5.11 + patch): $> schbench -m 20 -t 2 -r 30 Latency percentiles (usec) 50.0000th: 100 75.0000th: 113 90.0000th: 328 95.0000th: 360 *99.0000th: 434 (-80%) <---- 99.5000th: 489 99.9000th: 2324 min=0, max=6054 We see a significant reduction in the tail latencies due to being able to schedule on an yielded/ceded idle CPU with the patchset instead of waking up the task on a busy CPU. 2. Over-commit case: Both KVMs sharing same set of CPUs. One KVM is creating noise using `schbench -m 10 -t 2 -r 3000` while only the other KVM is benchmarked. - Baseline: $> schbench -m 20 -t 2 -r 30 Latency percentiles (usec) 50.0000th: 73 75.0000th: 89 90.0000th: 115 95.0000th: 166 *99.0000th: 3084 99.5000th: 4044 99.9000th: 7656 min=0, max=18448 - With patch: $> schbench -m 20 -t 2 -r 30 Latency percentiles (usec) 50.0000th: 114 75.0000th: 137 90.0000th: 170 95.0000th: 237 *99.0000th: 2828 99.5000th: 4168 99.9000th: 7528 min=0, max=15387 The results demonstrates that the proposed method of getting idle-hint from the hypervisor to better find an idle cpu in the guestOS is very helpful in under-commmit cases due to higher chance of finding the previously used physical cpu as idle. The results also confirms that there is no regression in the over-commit case where the proposed methodlogy does not affect much. Additionally, more tests were carried out with different combinations of schbench threads and different numbers of KVM guest. The results for these tests further confirmed that there is no major regression on the workload performance. Parth Shah (2): KVM:PPC: Add new hcall to provide hint if a vcpu task will be scheduled instantly. sched: Use H_IDLE_HINT hcall to find if a vCPU can be wakeup target arch/powerpc/include/asm/hvcall.h | 3 ++- arch/powerpc/include/asm/paravirt.h | 21 +++++++++++++++++++-- arch/powerpc/kvm/book3s_hv.c | 13 +++++++++++++ arch/powerpc/kvm/trace_hv.h | 1 + include/linux/kvm_host.h | 1 + include/linux/sched.h | 1 + kernel/sched/core.c | 13 +++++++++++++ kernel/sched/fair.c | 12 ++++++++++++ kernel/sched/sched.h | 1 + virt/kvm/kvm_main.c | 17 +++++++++++++++++ 10 files changed, 80 insertions(+), 3 deletions(-) -- 2.26.2