Hi Matt, On 23 September 2016 at 13:58, Matt Fleming <m...@codeblueprint.co.uk> wrote: > Since commit 7dc603c9028e ("sched/fair: Fix PELT integrity for new > tasks") ::last_update_time will be set to a non-zero value in > post_init_entity_util_avg(), which leads to p->se.avg.load_avg being > decayed on enqueue before the task has even had a chance to run. > > For a NICE_0 task the sequence of events leading up to this with > example load average changes might be, > > sched_fork() > init_entity_runnable_average() > p->se.avg.load_avg = scale_load_down(se->load.weight); // 1024 > > wake_up_new_task() > post_init_entity_util_avg() > attach_entity_load_avg() > p->se.last_update_time = cfs_rq->avg.last_update_time; > > activate_task() > enqueue_task() > ... > enqueue_entity_load_avg() > migrated = !sa->last_update_time // false > if (!migrated) > __update_load_avg() > p->se.avg.load_avg = 1002
Does it mean that you can see the perf drop that you mention below because load is decayed to 1002 instead of staying to 1024 ? 1002 mainly comes from period_contrib being set to 1023 during init_entity_runnable_average so any delay longer than 1us between attach_entity_load_avg and enqueue_entity_load_avg will trig the decay of the load from 1024 to 1002 > > This causes a performance regression for fork intensive workloads like > hackbench. When balancing on fork we can end up picking the same CPU > to enqueue on over and over. This leads to huge congestion when trying > to simultaneously wake up tasks that are all on the same runqueue, and > causes lots of migrations on wake up. > > The behaviour since commit 7dc603c9028e essentially defeats the > scheduler's attempt to balance on fork(). Before, ::runnable_load_avg > likely had a non-zero value when the hackbench tasks were dequeued > (the fork()'d tasks immediately block reading on pipe/socket) but now > the load balancer sees the CPU as having no runnable load. But this patch doesn't change the behavior of runnable_load_avg, isn't it ? it has only an impact on the initial value of p->se.avg.load_avg when the task is enqueued. > > Arguably the real problem is that balancing on fork doesn't look at > the blocked contribution of tasks, only the runnable load and it's > possible for the two metrics to be wildly different on a relatively > idle system. fair enough > > But it still doesn't seem quite right to update a task's load_avg > before it runs for the first time. > > Here are the results of running hackbench before 7dc603c9028e (old > behaviour), with 7dc603c9028e applied (exiting behaviour), and after > 7dc603c9028e with this patch on top (new behaviour), > > hackbench-process-sockets > > 4.7.0-rc5 4.7.0-rc5 4.7.0-rc5 > before 7dc603c9028e after > Amean 1 0.0611 ( 0.00%) 0.0693 (-13.32%) 0.0600 ( 1.87%) > Amean 4 0.1777 ( 0.00%) 0.1730 ( 2.65%) 0.1790 ( -0.72%) > Amean 7 0.2771 ( 0.00%) 0.2816 ( -1.60%) 0.2741 ( 1.08%) > Amean 12 0.3851 ( 0.00%) 0.4167 ( -8.20%) 0.3751 ( 2.60%) > > Cc: Peter Zijlstra <pet...@infradead.org> > Cc: Ingo Molnar <mi...@kernel.org> > Cc: Mike Galbraith <umgwanakikb...@gmail.com> > Cc: Yuyang Du <yuyang...@intel.com> > Cc: Vincent Guittot <vincent.guit...@linaro.org> > Cc: Dietmar Eggemann <dietmar.eggem...@arm.com> > Signed-off-by: Matt Fleming <m...@codeblueprint.co.uk> > --- > kernel/sched/fair.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 8fb4d1942c14..4a2d3ff772f8 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3142,7 +3142,7 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct > sched_entity *se) > int migrated, decayed; > > migrated = !sa->last_update_time; > - if (!migrated) { > + if (!migrated && se->sum_exec_runtime) { > __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa, > se->on_rq * scale_load_down(se->load.weight), > cfs_rq->curr == se, NULL); > -- > 2.10.0 >