At present, scheduler scales CPU capacity for fair tasks based on time spent on irq and steal time. If a CPU sees irq or steal time, its capacity for fair tasks decreases causing tasks to migrate to other CPU that are not affected by irq and steal time. All of this is gated by scheduler feature NONTASK_CAPACITY.
In virtualized setups, a CPU that reports steal time (time taken by the hypervisor) can cause tasks to migrate unnecessarily to sibling CPUs that appear to be less busy, only for the situation to reverse shortly. To mitigate this ping-pong behaviour, this change introduces a new static branch sched_acct_steal_cap which will control whether steal time contributes to non-task capacity adjustments (used for fair scheduling). Signed-off-by: Srikar Dronamraju <[email protected]> --- Changelog v1->v2: v1: https://lkml.kernel.org/r/[email protected] Peter suggested to use static branch instead of sched feat include/linux/sched/topology.h | 6 ++++++ kernel/sched/core.c | 15 +++++++++++++-- 2 files changed, 19 insertions(+), 2 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 198bb5cc1774..88e34c60cffd 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -285,4 +285,10 @@ static inline int task_node(const struct task_struct *p) return cpu_to_node(task_cpu(p)); } +#ifdef CONFIG_HAVE_SCHED_AVG_IRQ +extern void sched_disable_steal_acct(void); +#else +static __always_inline void sched_disable_steal_acct(void) { } +#endif + #endif /* _LINUX_SCHED_TOPOLOGY_H */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 81c6df746df1..09884da6b085 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -738,6 +738,14 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf) /* * RQ-clock updating methods: */ +#ifdef CONFIG_HAVE_SCHED_AVG_IRQ +static DEFINE_STATIC_KEY_TRUE(sched_acct_steal_cap); + +void sched_disable_steal_acct(void) +{ + return static_branch_disable(&sched_acct_steal_cap); +} +#endif static void update_rq_clock_task(struct rq *rq, s64 delta) { @@ -792,8 +800,11 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) rq->clock_task += delta; #ifdef CONFIG_HAVE_SCHED_AVG_IRQ - if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY)) - update_irq_load_avg(rq, irq_delta + steal); + if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY)) { + if (steal && static_branch_likely(&sched_acct_steal_cap)) + irq_delta += steal; + update_irq_load_avg(rq, irq_delta); + } #endif update_rq_clock_pelt(rq, delta); } -- 2.47.3
