On Thu, Jun 26, 2025 at 12:40:59AM +0530, Shrikanth Hegde wrote: > This is a followup version if [1] with few additions. This is still an RFC > and would like get feedback on the idea and suggestions on improvement. > > v1->v2: > - Renamed to cpu_avoid_mask in place of cpu_parked_mask.
This one is not any better to the previous. Why avoid? When avoid? I already said that: for objects, having positive self-explaining noun names is much better than negative and/or function-style verb names. I suggested cpu_paravirt_mask, and I still believe it's a much better option. > - Used a static key such that no impact to regular case. Static keys are not free and designed for different purpose. You have CONFIG_PARAVIRT, and I don't understand why you're trying to avoid using it. I don't mind about static keys, if you prefer them, I just want to have feature-specific code under corresponding config. Can you please print bloat-o-meter report for CONFIG_PARAVIRT=n? Have you any perf numbers to advocate static keys here? > - add sysfs file to show avoid CPUs. > - Make RT understand avoid CPUs. > - Add documentation patch > - Took care of reported compile error in [1] when NR_CPUS=1 > > ----------------- > Problem statement > ----------------- > vCPU - Virtual CPUs - CPU in VM world. > pCPU - Physical CPUs - CPU in baremetal world. > > A hypervisor is managing these vCPUs from different VMs. When a vCPU > requests for CPU, hypervisor does the job of scheduling them on a pCPU. > > So this issue occurs when there are more vCPUs(combined across all VMs) > than the pCPU. So when *all* vCPUs are requesting for CPUs, hypervisor > can only run a few of them and remaining will be preempted(waiting for pCPU). > > If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from > VM2, it has to do save/restore VM context.Instead if VM's can co-ordinate > among > each other and request for *limited* vCPUs, it avoids the above overhead and ^ Did this extra whitespace escaped from the previous line, or the following? v > there is context switching within vCPU(less expensive). Even if hypervisor > is preempting one vCPU to run another within the same VM, it is still more > expensive than the task preemption within the vCPU. So *basic* aim to avoid > vCPU preemption. > > So to achieve this, use "CPU Avoid" concept, where it is better > if workload avoids these vCPUs at this moment. > (vCPUs stays online, we don't want the overhead of sched domain rebuild). > > Contention is dynamic in nature. When there is contention for pCPU is to be > detected and determined by architecture. Archs needs to update the mask > accordingly. > > When there is contention, use limited vCPUs as indicated by arch. > When there is no contention, use all vCPUs. > > ------------------------- > To be done and Questions: > ------------------------- > 1. IRQ - still don't understand this cpu_avoid_mask. Maybe irqbalance > code could be modified to do the same. Ran stress-ng --hrtimers, irq > moved out of avoid cpu though. So need to see if changes to irqbalance is > required or not. > > 2. If a task is spawned by affining to only avoid CPUs. Should that fail > or throw a warning to user. I think it's possible that existing codebase will do that. And because you don't want to break userspace, you should not restrict. > 3. Other classes such as SCHED_EXT, SCHED_DL won't understand this infra > yet. > > 4. Performance testing yet to be done. RFC only verified the functional > aspects of whether task move out of avoid CPUs or not. Move happens quite > fast (around 1-2 seconds even on large systems with very high utilization) > > 5. Haven't come up an infra which could combine all push task related > changes. It is currently spread across rt, dl, fair. Maybe some > consolidation can be done. but which tasks to push/pull still remains in > the class. > > 6. cpu_avoid_mask may need some sort of locking to ensure read/write is > correct. > > [1]: > https://lore.kernel.org/all/20250523181448.3777233-1-sshe...@linux.ibm.com/ > > Shrikanth Hegde (9): > sched/docs: Document avoid_cpu_mask and avoid CPU concept > cpumask: Introduce cpu_avoid_mask > sched/core: Don't allow to use CPU marked as avoid > sched/fair: Don't use CPU marked as avoid for wakeup and load balance > sched/rt: Don't select CPU marked as avoid for wakeup and push/pull rt task > sched/core: Push current task out if CPU is marked as avoid > sched: Add static key check for cpu_avoid > sysfs: Add cpu_avoid file > powerpc: add debug file for set/unset cpu avoid > > Documentation/scheduler/sched-arch.rst | 25 +++++++++++++ > arch/powerpc/include/asm/paravirt.h | 2 ++ > arch/powerpc/kernel/smp.c | 50 ++++++++++++++++++++++++++ > drivers/base/cpu.c | 8 +++++ > include/linux/cpumask.h | 17 +++++++++ > kernel/cpu.c | 3 ++ > kernel/sched/core.c | 50 +++++++++++++++++++++++++- > kernel/sched/fair.c | 11 +++++- > kernel/sched/rt.c | 9 +++-- > kernel/sched/sched.h | 10 ++++++ > 10 files changed, 181 insertions(+), 4 deletions(-) > > -- > 2.43.0