Hi Peter, Le Tuesday 10 Oct 2017 à 09:44:53 (+0200), Vincent Guittot a écrit : > On 10 October 2017 at 09:29, Peter Zijlstra <pet...@infradead.org> wrote: > > On Mon, Oct 09, 2017 at 05:29:04PM +0200, Vincent Guittot wrote: > >> On 9 October 2017 at 17:03, Vincent Guittot <vincent.guit...@linaro.org> > >> wrote: > >> > On 1 September 2017 at 15:21, Peter Zijlstra <pet...@infradead.org> > >> > wrote: > > > >> >> +/* > >> >> + * When on migration a sched_entity joins/leaves the PELT hierarchy, > >> >> we need to > >> >> + * propagate its contribution. The key to this propagation is the > >> >> invariant > >> >> + * that for each group: > >> >> + * > >> >> + * ge->avg == grq->avg > >> >> (1) > >> >> + * > >> >> + * _IFF_ we look at the pure running and runnable sums. Because they > >> >> + * represent the very same entity, just at different points in the > >> >> hierarchy. > >> > > >> > I agree for the running part because only one entity can be running > >> > but i'm not sure for the pure runnable sum because we can have > >> > several runnable task in a cfs_rq but only one runnable group entity > >> > to reflect them or I misunderstand (1) > > > > The idea is that they (ge and grq) are the _same_ entity, just at > > different levels in the hierarchy. If the grq is runnable, it is through > > the ge. > > > > As a whole, they don't care how many runnable tasks there are. > > Ok so I agree with this point that both grp and ge runnable follow the > same trend > > my point is that even if same changes apply on both grp and ge, we > can't directly apply changes of grp's runnable on ge's runnable > > > > >> > As an example, we have 2 always running task TA and TB so their > >> > load_sum is LOAD_AVG_MAX for each task The grq->avg.load_sum = \Sum > >> > se->avg.load_sum = 2*LOAD_AVG_MAX But the ge->avg.load_sum will be > >> > only LOAD_AVG_MAX > >> > > >> > So If we apply directly the d(TB->avg.load_sum) on the group hierachy > >> > and on ge->avg.load_sum in particular, the latter decreases to 0 > >> > whereas it should decrease only by half > >> > > >> > I have been able to see this wrong behavior with a rt-app json file > >> > > >> > so I think that we should instead remove only > >> > > >> > delta = se->avg.load_sum / grq->avg.load_sum * ge->avg.load_sum > >> > >> delta = se->avg.load_sum / (grq->avg.load_sum+se->avg.load_sum) * > >> ge->avg.load_sum > >> > >> as the se has already been detached > >> > >> > We don't have grq->avg.load_sum but we can have a rough estimate with > >> > grq->avg.load_avg/grq->weight > > > > Hurm, I think I see what you're saying, let me ponder this more. > > The formula above works was an example for detaching but it doesn't > apply for all use case. we need a more generic way to propagate > runnable changes
I have studied a bit more how to improve the propagation formula and the changes below is doing the job for the UCs that I have tested. Unlike running, we can't directly propagate the runnable through hierarchy when we migrate a task. Instead we must ensure that we will not over/underestimate the impact of the migration thanks to several rules: -ge->avg.runnable_sum can't be higher than LOAD_AVG_MAX -ge->avg.runnable_sum can't be lower than ge->avg.running_sum (once scaled to the same range) -we can't directly propagate a negative delta of runnable_sum because part of this runnable time can be "shared" with others sched_entities and stays on the gcfs_rq. Instead, we can't estimate the new runnable_sum of the gcfs_rq with the formula: gcfs_rq's runnable sum = gcfs_rq's load_sum / gcfs_rq's weight. -ge->avg.runnable_sum can't increase when we detach a task. --- kernel/sched/fair.c | 56 ++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 45 insertions(+), 11 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 350dbec0..a063b048 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3489,33 +3489,67 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq static inline void update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq) { - long runnable_sum = gcfs_rq->prop_runnable_sum; - long runnable_load_avg, load_avg; - s64 runnable_load_sum, load_sum; + long running_sum, runnable_sum = gcfs_rq->prop_runnable_sum; + long runnable_load_avg, delta_avg, load_avg; + s64 runnable_load_sum, delta_sum, load_sum = 0; if (!runnable_sum) return; gcfs_rq->prop_runnable_sum = 0; + /* + * Get a rough estimate of gcfs_rq's runnable + * This is a low guess as it assumes that tasks are equally + * runnable which is not true but we can't really do better + */ + if (scale_load_down(gcfs_rq->load.weight)) + load_sum = div_s64(gcfs_rq->avg.load_sum, + scale_load_down(gcfs_rq->load.weight)); + + /* + * Propating a delta of runnable is not just adding it to ge's + * runnable_sum: + * - Adding a delta runnable can't make ge's runnable_sum higher than + * LOAD_AVG_MAX + * - We can't directly remove a delta of runnable from + * ge's runnable_sum but we can only guest estimate what runnable + * will become thanks to few simple rules: + * - gcfs_rq's runnable is a good estimate + * - ge's runnable_sum can't increase when we remove runnable + * - runnable_sum can't be lower than running_sum + */ + if (runnable_sum >= 0) { + runnable_sum += se->avg.load_sum; + runnable_sum = min(runnable_sum, LOAD_AVG_MAX); + } else + runnable_sum = min(se->avg.load_sum, load_sum); + + running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT; + runnable_sum = max(runnable_sum, running_sum); + load_sum = (s64)se_weight(se) * runnable_sum; load_avg = div_s64(load_sum, LOAD_AVG_MAX); - add_positive(&se->avg.load_sum, runnable_sum); - add_positive(&se->avg.load_avg, load_avg); + delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum; + delta_avg = load_avg - se->avg.load_avg; - add_positive(&cfs_rq->avg.load_avg, load_avg); - add_positive(&cfs_rq->avg.load_sum, load_sum); + se->avg.load_sum = runnable_sum; + se->avg.load_avg = load_avg; + add_positive(&cfs_rq->avg.load_avg, delta_avg); + add_positive(&cfs_rq->avg.load_sum, delta_sum); runnable_load_sum = (s64)se_runnable(se) * runnable_sum; runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX); + delta_sum = runnable_load_sum - se_weight(se) * se->avg.runnable_load_sum; + delta_avg = runnable_load_avg - se->avg.runnable_load_avg; - add_positive(&se->avg.runnable_load_sum, runnable_sum); - add_positive(&se->avg.runnable_load_avg, runnable_load_avg); + se->avg.runnable_load_sum = runnable_sum; + se->avg.runnable_load_avg = runnable_load_avg; if (se->on_rq) { - add_positive(&cfs_rq->avg.runnable_load_avg, runnable_load_avg); - add_positive(&cfs_rq->avg.runnable_load_sum, runnable_load_sum); + add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg); + add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum); } } -- 2.7.4