On 28 September 2016 at 04:46, Vincent Guittot <vincent.guit...@linaro.org> wrote: > On 28 September 2016 at 04:31, Dietmar Eggemann > <dietmar.eggem...@arm.com> wrote: >> On 28/09/16 12:19, Peter Zijlstra wrote: >>> On Wed, Sep 28, 2016 at 12:06:43PM +0100, Dietmar Eggemann wrote: >>>> On 28/09/16 11:14, Peter Zijlstra wrote: >>>>> On Fri, Sep 23, 2016 at 12:58:08PM +0100, Matt Fleming wrote: >> >> [...] >> >>>> Not sure what you mean by 'after fixing' but the se is initialized with >>>> a possibly stale 'now' value in post_init_entity_util_avg()-> >>>> attach_entity_load_avg() before the clock is updated in >>>> activate_task()->enqueue_task(). >>> >>> I meant that after I fix the above issue of calling post_init with a >>> stale clock. So the + update_rq_clock(rq) in the patch. >> >> OK. >> >> [...] >> >>>>> While staring at this, I don't think we can still hit >>>>> vruntime_normalized() with a new task, so I _think_ we can remove that >>>>> !se->sum_exec_runtime clause there (and rejoice), no? >>>> >>>> I'm afraid that with accurate timing we will get the same situation that >>>> we add and subtract the same amount of load (probably 1024 now and not >>>> 1002 (or less)) to/from cfs_rq->runnable_load_avg for the initial (fork) >>>> hackbench run. >>>> After all, it's 'runnable' based. >>> >>> The idea was that since we now update rq clock before post_init and then >>> leave it be, both post_init and enqueue see the exact same timestamp, >>> and the delta is 0, resulting in no aging. >>> >>> Or did I fail to make that happen? >> >> No, but IMHO what Matt wants is ageing for the hackench tasks at the end >> of their fork phase so there is a tiny amount of >> cfs_rq->runnable_load_avg left on cpuX after the fork related dequeue so >> the (load-based) fork-balancer chooses cpuY for the next hackbench task. >> That's why he wanted to avoid the __update_load_avg(se) on enqueue (thus >> adding 1024 to cfs_rq->runnable_load_avg) and do the ageing only on >> dequeue (removing <1024 from cfs_rq->runnable_load_avg). > > ok so i'm a bit confused there > my understand of your explanation above is that now we left a small > amount of load in runnable_load_avg after the dequeue so another cpu > will be chosen. But this explanation seems to be the opposite of what > Matt said in a previous email that: > "The performance drop comes from the fact that enqueueing/dequeueing a > task with load 1002 during fork() results in a zero runnable_load_avg, > which signals to the load balancer that the CPU is idle, so the next > time we fork() we'll pick the same CPU to enqueue on -- and the cycle > continues."
sorry forgot my question, i just misread your explanation. Matt, May be you can try this patch which uses utilization in find_idlest_group. So even if runnable_load_avg is null, the utilization should not and another cpu will be chosen https://patchwork.kernel.org/patch/9306939/ > >> >>