On Wed, Jan 30, 2019 at 06:22:47AM +0100, Vincent Guittot wrote: > The algorithm used to order cfs_rq in rq->leaf_cfs_rq_list assumes that > it will walk down to root the 1st time a cfs_rq is used and we will finish > to add either a cfs_rq without parent or a cfs_rq with a parent that is > already on the list. But this is not always true in presence of throttling. > Because a cfs_rq can be throttled even if it has never been used but other > CPUs > of the cgroup have already used all the bandwdith, we are not sure to go down > to > the root and add all cfs_rq in the list. > > Ensure that all cfs_rq will be added in the list even if they are throttled.
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index e2ff4b6..826fbe5 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -352,6 +352,20 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq > *cfs_rq) > } > } > > +static inline void list_add_branch_cfs_rq(struct sched_entity *se, struct rq > *rq) > +{ > + struct cfs_rq *cfs_rq; > + > + for_each_sched_entity(se) { > + cfs_rq = cfs_rq_of(se); > + list_add_leaf_cfs_rq(cfs_rq); > + > + /* If parent is already in the list, we can stop */ > + if (rq->tmp_alone_branch == &rq->leaf_cfs_rq_list) > + break; > + } > +} > + > /* Iterate through all leaf cfs_rq's on a runqueue: */ > #define for_each_leaf_cfs_rq(rq, cfs_rq) \ > list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list) > @@ -5179,6 +5197,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, > int flags) > > } > > + /* Ensure that all cfs_rq have been added to the list */ > + list_add_branch_cfs_rq(se, rq); > + > hrtick_update(rq); > } So I don't much like this; at all. But maybe I misunderstand, this is somewhat tricky stuff and I've not looked at it in a while. So per normal we do: enqueue_task_fair() for_each_sched_entity() { if (se->on_rq) break; enqueue_entity() list_add_leaf_cfs_rq(); } This ensures that all parents are already enqueued, right? because this is what enqueues those parents. And in this case you add an unconditional second for_each_sched_entity(); even though it is completely redundant, afaict. The problem seems to stem from the whole throttled crud; which (also) breaks the above enqueue loop on throttle state, and there the parent can go missing. So why doesn't this live in unthrottle_cfs_rq() ?