On Fri, 26 Jun, 2026, 4:26 pm Vishal Chourasia, <[email protected]>
wrote:
> This series fixes a KVM scheduling bug on Book3S HV (POWER8/POWER9/POWER10)
> where a guest VM under a cpu.max bandwidth limit can run arbitrarily past
> its
> quota and then appear completely frozen for minutes afterwards.
>
> == Background ==
>
> Commit 2cd571245b43 ("sched/fair: Add related data structure for task based
> throttle"), merged in v6.18, changed how CFS bandwidth throttling enforces
> its limit. Previously, throttle_cfs_rq() dequeued tasks directly. Under the
> new scheme it queues a task_work item via task_work_add(..., TWA_RESUME),
> sets TIF_NOTIFY_RESUME, and relies on that work running on the kernel
> return
> path to actually dequeue the task.
>
> For KVM guests this means the work must be drained before each guest entry,
> not just on the normal syscall return path. commit 935ace2fb5cc ("entry:
> Provide infrastructure for work before transitioning to guest mode")
> introduced kvm_xfer_to_guest_mode_handle_work() for exactly this purpose.
> x86 (commit 72c3c0fe54a3), arm64 (commit 6caa5812e2d1), riscv, s390, and
> loongarch all adopted it. Book3S HV did not. [1]
>
IMHO, there is significant info in the cover letter (Background and RCA)
that deserves to be part of the patch commit log.
Thanks
Harsh
>
> == Root Cause ==
>
> Book3S HV's vCPU run loops — kvmhv_run_single_vcpu() for POWER9+ and
> kvmppc_run_vcpu() for pre-POWER9 — only test TIF_SIGPENDING and
> TIF_NEED_RESCHED before re-entering the guest. TIF_NOTIFY_RESUME is never
> checked, and the deferred throttle task_work therefore never runs while a
> vCPU is inside the run loop.
>
> For a CPU-bound guest that generates few KVM exits back to QEMU user space
> (e.g. a compute-heavy or busy-looping workload), the vCPU thread never
> returns to user mode. throttle_cfs_rq() sets cfs_rq->throttled = 1 and
> queues the task_work, but the guest continues to run unchecked.
> cfs_rq->runtime_remaining goes increasingly negative with every scheduling
> period while the throttle flag sits ignored.
>
> The only mechanism recovering that debt is the periodic bandwidth timer
> replenishment: 30 ms of quota is added per 100 ms period. When
> runtime_remaining has drifted hundreds of seconds negative, recovering to
> zero at 300 ms/s takes minutes — during which the cgroup is legitimately
> throttled and the VM is completely frozen once it finally exits to user
> space.
>
> == Debugging ==
>
> vCPU was placed in a cgroup where CPU bandwidth limits were set.
> quota = 30ms
> period = 100ms
>
> The bug was diagnosed using a bpftrace script probing throttle_cfs_rq()
> and unthrottle_cfs_rq() and sampling cfs_rq->runtime_remaining every
> second. The trace shows the debt accumulation phase, the slow recovery
> phase, and the immediate re-throttle on resumption:
>
> Debt accumulation (vCPU in guest, no exits):
> +1471 s runtime_remaining=-209702865115 ns throttled=1
> +1472 s runtime_remaining=-210402866357 ns throttled=1
> ... # ~-700 ms/s (growing debt)
> +1477 s runtime_remaining=-213902833931 ns throttled=1
>
> Recovery (vCPU exits to QEMU user space; bandwidth timer replenishes):
> +1478 s runtime_remaining=-213617443453 ns throttled=1
> +1479 s runtime_remaining=-213317443453 ns throttled=1
> ... # ~+300 ms/s (30ms quota/100ms)
>
> After ~710 seconds of recovery, debt reaches zero:
> ──── unthrottle_cfs_rq @ cpu=768 +2190.029568131 s ────
> runtime_remaining = 1 ns # just crossed zero
>
> The vCPU immediately re-enters the guest and over-runs its quota again:
> ──── throttle_cfs_rq @ cpu=768 +2190.055327252 s ────
> runtime_remaining = -5667293 ns # 26 ms of debt already
>
> The cycle then repeats identically from a fresh -700 ms/s accumulation.
>
> cpu.stat confirms the pathology — 100% throttle rate and virtually all
> CPU time accumulated in kernel (KVM) mode:
>
> nr_periods = 117457
> nr_throttled = 117457 # every single period
> system_usec = 4334782636 # >99.99% kernel time (QEMU in KVM_RUN)
>
> strace of the QEMU vCPU thread confirms long stretches where
> ioctl(KVM_RUN) does not return — the vCPU is running in guest mode
> with no VM-exits reaching user space.
>
> == Fix Summary ==
>
> Opt Book3S HV into VIRT_XFER_TO_GUEST_WORK and drain pending guest-mode
> work (including the deferred CFS throttle task_work) on every guest
> re-entry in both run loops. The changes are supersets of the existing
> need_resched() checks and do not alter the signal or exit accounting.
>
> [1]
> https://lore.kernel.org/all/[email protected]/
>
> Vishal Chourasia (1):
> KVM: powerpc/book3s_hv: Use generic xfer to guest work function
>
> arch/powerpc/kvm/Kconfig | 1 +
> arch/powerpc/kvm/book3s_hv.c | 58 +++++++++++++++++++++++++++++++-----
> 2 files changed, 52 insertions(+), 7 deletions(-)
>
> --
> 2.54.0
>
>
>