On Mon, Aug 20, 2018 at 04:54:25PM -0700, Miguel de Dios wrote: > On 08/17/2018 11:27 AM, Steve Muckle wrote: > > From: John Dias <joaod...@google.com> > > > > When rt_mutex_setprio changes a task's scheduling class to RT, > > we're seeing cases where the task's vruntime is not updated > > correctly upon return to the fair class. > > Specifically, the following is being observed: > > - task is deactivated while still in the fair class > > - task is boosted to RT via rt_mutex_setprio, which changes > > the task to RT and calls check_class_changed. > > - check_class_changed leads to detach_task_cfs_rq, at which point > > the vruntime_normalized check sees that the task's state is TASK_WAKING, > > which results in skipping the subtraction of the rq's min_vruntime > > from the task's vruntime > > - later, when the prio is deboosted and the task is moved back > > to the fair class, the fair rq's min_vruntime is added to > > the task's vruntime, even though it wasn't subtracted earlier. > > The immediate result is inflation of the task's vruntime, giving > > it lower priority (starving it if there's enough available work). > > The longer-term effect is inflation of all vruntimes because the > > task's vruntime becomes the rq's min_vruntime when the higher > > priority tasks go idle. That leads to a vicious cycle, where > > the vruntime inflation repeatedly doubled. > > > > The change here is to detect when vruntime_normalized is being > > called when the task is waking but is waking in another class, > > and to conclude that this is a case where vruntime has not > > been normalized. > > > > Signed-off-by: John Dias <joaod...@google.com> > > Signed-off-by: Steve Muckle <smuc...@google.com> > > --- > > kernel/sched/fair.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index b39fb596f6c1..14011d7929d8 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct > > task_struct *p) > > * - A task which has been woken up by try_to_wake_up() and > > * waiting for actually being woken up by sched_ttwu_pending(). > > */ > > - if (!se->sum_exec_runtime || p->state == TASK_WAKING) > > + if (!se->sum_exec_runtime || > > + (p->state == TASK_WAKING && p->sched_class == &fair_sched_class)) > > return true; > > return false;
> The normalization of vruntime used to exist in task_waking but it was > removed and the normalization was moved into migrate_task_rq_fair. The > reasoning being that task_waking_fair was only hit when a task is queued > onto a different core and migrate_task_rq_fair should do the same work. > > However, we're finding that there's one case which migrate_task_rq_fair > doesn't hit: that being the case where rt_mutex_setprio changes a task's > scheduling class to RT when its scheduled out. The task never hits > migrate_task_rq_fair because it is switched to RT and migrates as an RT > task. Because of this we're getting an unbounded addition of min_vruntime > when the task is re-attached to the CFS runqueue when it loses the inherited > priority. The patch above works because now the kernel specifically checks > for this case and normalizes accordingly. > > Here's the patch I was talking about: > https://lore.kernel.org/patchwork/patch/677689/. In our testing we were > seeing vruntimes nearly double every time after rt_mutex_setprio boosts the > task to RT. Bah, patchwork is such shit... how do you get to the previus patch from there? Because I think 2/3 is the actual commit that changed things, 3/3 just cleans up a bit. That would be commit: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration") But I'm still somewhat confused; how would task_waking_fair() have helped if we're already changed to a different class?