Hi Peter,

Le Tuesday 10 Oct 2017 à 09:44:53 (+0200), Vincent Guittot a écrit :
> On 10 October 2017 at 09:29, Peter Zijlstra <pet...@infradead.org> wrote:
> > On Mon, Oct 09, 2017 at 05:29:04PM +0200, Vincent Guittot wrote:
> >> On 9 October 2017 at 17:03, Vincent Guittot <vincent.guit...@linaro.org> 
> >> wrote:
> >> > On 1 September 2017 at 15:21, Peter Zijlstra <pet...@infradead.org> 
> >> > wrote:
> >
> >> >> +/*
> >> >> + * When on migration a sched_entity joins/leaves the PELT hierarchy, 
> >> >> we need to
> >> >> + * propagate its contribution. The key to this propagation is the 
> >> >> invariant
> >> >> + * that for each group:
> >> >> + *
> >> >> + *   ge->avg == grq->avg                                               
> >> >> (1)
> >> >> + *
> >> >> + * _IFF_ we look at the pure running and runnable sums. Because they
> >> >> + * represent the very same entity, just at different points in the 
> >> >> hierarchy.
> >> >
> >> > I agree for the running part because only one entity can be running
> >> > but i'm not sure for the pure runnable sum because we can have
> >> > several runnable task in a cfs_rq but only one runnable group entity
> >> > to reflect them or I misunderstand (1)
> >
> > The idea is that they (ge and grq) are the _same_ entity, just at
> > different levels in the hierarchy. If the grq is runnable, it is through
> > the ge.
> >
> > As a whole, they don't care how many runnable tasks there are.
> 
> Ok so I agree with this point that both grp and ge runnable follow the
> same trend
> 
> my point is that even if same changes apply on both grp and ge, we
> can't directly apply changes of grp's runnable on ge's runnable
> 
> >
> >> > As an example, we have 2 always running task TA and TB so their
> >> > load_sum is LOAD_AVG_MAX for each task The grq->avg.load_sum = \Sum
> >> > se->avg.load_sum = 2*LOAD_AVG_MAX But the ge->avg.load_sum will be
> >> > only LOAD_AVG_MAX
> >> >
> >> > So If we apply directly the d(TB->avg.load_sum) on the group hierachy
> >> > and on ge->avg.load_sum in particular, the latter decreases to 0
> >> > whereas it should decrease only by half
> >> >
> >> > I have been able to see this wrong behavior with a rt-app json file
> >> >
> >> > so I think that we should instead remove only
> >> >
> >> > delta = se->avg.load_sum / grq->avg.load_sum * ge->avg.load_sum
> >>
> >> delta = se->avg.load_sum / (grq->avg.load_sum+se->avg.load_sum) *
> >> ge->avg.load_sum
> >>
> >> as the se has already been detached
> >>
> >> > We don't have grq->avg.load_sum but we can have a rough estimate with
> >> > grq->avg.load_avg/grq->weight
> >
> > Hurm, I think I see what you're saying, let me ponder this more.
> 
> The formula above works was an example for detaching but it doesn't
> apply for all use case. we need a more generic way to propagate
> runnable changes

I have studied a bit more how to improve the propagation formula and the
changes below is doing the job for the UCs that I have tested.

Unlike running, we can't directly propagate the runnable through hierarchy
when we migrate a task. Instead we must ensure that we will not
over/underestimate the impact of the migration thanks to several rules:
-ge->avg.runnable_sum can't be higher than LOAD_AVG_MAX
-ge->avg.runnable_sum can't be lower than ge->avg.running_sum (once scaled to
the same range)
-we can't directly propagate a negative delta of runnable_sum because part of
this runnable time can be "shared" with others sched_entities and stays on the
gcfs_rq. Instead, we can't estimate the new runnable_sum of the gcfs_rq with
the formula: gcfs_rq's runnable sum = gcfs_rq's load_sum / gcfs_rq's weight.
-ge->avg.runnable_sum can't increase when we detach a task.

---
 kernel/sched/fair.c | 56 ++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 45 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 350dbec0..a063b048 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3489,33 +3489,67 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct 
sched_entity *se, struct cfs_rq
 static inline void
 update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct 
cfs_rq *gcfs_rq)
 {
-       long runnable_sum = gcfs_rq->prop_runnable_sum;
-       long runnable_load_avg, load_avg;
-       s64 runnable_load_sum, load_sum;
+       long running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
+       long runnable_load_avg, delta_avg, load_avg;
+       s64 runnable_load_sum, delta_sum, load_sum = 0;
 
        if (!runnable_sum)
                return;
 
        gcfs_rq->prop_runnable_sum = 0;
 
+       /*
+        * Get a rough estimate of gcfs_rq's runnable
+        * This is a low guess as it assumes that tasks are equally
+        * runnable which is not true but we can't really do better
+        */
+       if (scale_load_down(gcfs_rq->load.weight))
+               load_sum = div_s64(gcfs_rq->avg.load_sum,
+                               scale_load_down(gcfs_rq->load.weight));
+
+       /*
+        * Propating a delta of runnable is not just adding it to ge's
+        * runnable_sum:
+        * - Adding a delta runnable can't make ge's runnable_sum higher than
+        *   LOAD_AVG_MAX
+        * - We can't directly remove a delta of runnable from
+        *   ge's runnable_sum but we can only guest estimate what runnable
+        *   will become thanks to few simple rules:
+        *   - gcfs_rq's runnable is a good estimate
+        *   - ge's runnable_sum can't increase when we remove runnable
+        *   - runnable_sum can't be lower than running_sum
+        */
+       if (runnable_sum >= 0) {
+               runnable_sum += se->avg.load_sum;
+               runnable_sum = min(runnable_sum, LOAD_AVG_MAX);
+       } else
+               runnable_sum = min(se->avg.load_sum, load_sum);
+
+       running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT;
+       runnable_sum = max(runnable_sum, running_sum);
+
        load_sum = (s64)se_weight(se) * runnable_sum;
        load_avg = div_s64(load_sum, LOAD_AVG_MAX);
 
-       add_positive(&se->avg.load_sum, runnable_sum);
-       add_positive(&se->avg.load_avg, load_avg);
+       delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum;
+       delta_avg = load_avg - se->avg.load_avg;
 
-       add_positive(&cfs_rq->avg.load_avg, load_avg);
-       add_positive(&cfs_rq->avg.load_sum, load_sum);
+       se->avg.load_sum = runnable_sum;
+       se->avg.load_avg = load_avg;
+       add_positive(&cfs_rq->avg.load_avg, delta_avg);
+       add_positive(&cfs_rq->avg.load_sum, delta_sum);
 
        runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
        runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
+       delta_sum = runnable_load_sum - se_weight(se) * 
se->avg.runnable_load_sum;
+       delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
 
-       add_positive(&se->avg.runnable_load_sum, runnable_sum);
-       add_positive(&se->avg.runnable_load_avg, runnable_load_avg);
+       se->avg.runnable_load_sum = runnable_sum;
+       se->avg.runnable_load_avg = runnable_load_avg;
 
        if (se->on_rq) {
-               add_positive(&cfs_rq->avg.runnable_load_avg, runnable_load_avg);
-               add_positive(&cfs_rq->avg.runnable_load_sum, runnable_load_sum);
+               add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
+               add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
        }
 }
 
-- 
2.7.4

Reply via email to