On Mon, 10 Oct, at 07:34:40PM, Vincent Guittot wrote: > > Subject: [PATCH] sched: use load_avg for selecting idlest group > > select_busiest_group only compares the runnable_load_avg when looking for > the idlest group. But on fork intensive use case like hackbenchw here task > blocked quickly after the fork, this can lead to selecting the same CPU > whereas other CPUs, which have similar runnable load but a lower load_avg, > could be chosen instead. > > When the runnable_load_avg of 2 CPUs are close, we now take into account > the amount of blocked load as a 2nd selection factor. > > For use case like hackbench, this enable the scheduler to select different > CPUs during the fork sequence and to spread tasks across the system. > > Tests have been done on a Hikey board (ARM based octo cores) for several > kernel. The result below gives min, max, avg and stdev values of 18 runs > with each configuration. > > The v4.8+patches configuration also includes the changes below which is part > of the > proposal made by Peter to ensure that the clock will be up to date when the > fork task will be attached to the rq. > > @@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p) > __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); > #endif > rq = __task_rq_lock(p, &rf); > + update_rq_clock(rq); > post_init_entity_util_avg(&p->se); > > activate_task(rq, p, 0); > > hackbench -P -g 1 > > ea86cb4b7621 7dc603c9028e v4.8 v4.8+patches > min 0.049 0.050 0.051 0,048 > avg 0.057 0.057(0%) 0.057(0%) 0,055(+5%) > max 0.066 0.068 0.070 0,063 > stdev +/-9% +/-9% +/-8% +/-9% > > Signed-off-by: Vincent Guittot <vincent.guit...@linaro.org> > --- > kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++-------- > 1 file changed, 32 insertions(+), 8 deletions(-)
This patch looks pretty good to me and this 2-socket 48-cpu Xeon (domain0 SMT, domain1 MC, domain2 NUMA) shows a few nice performance improvements, and no regressions for various combinations of hackbench sockets/pipes and group numbers. But on a 2-socket 8-cpu Xeon (domain0 MC, domain1 DIE) running, perf stat --null -r 25 -- hackbench -pipe 30 process 1000 I see a regression, baseline: 2.41228 patched : 2.64528 (-9.7%) Even though the spread of tasks during fork[0] is improved, baseline CV: 0.478% patched CV : 0.042% Clearly the spread wasn't *that* bad to begin with on this machine for this workload. I consider the baseline spread to be pretty well distributed. Some other factor must be at play. Patched runqueue latencies are higher (max9* are percentiles), baseline: mean: 615932.69 max90: 75272.00 max95: 175985.00 max99: 5884778.00 max: 1694084747.00 patched: mean : 882026.28 max90: 92015.00 max95: 291760.00 max99: 7590167.00 max: 1841154776.00 And there are more migrations of hackbench tasks, baseline: total: 5390 cross-MC: 3810 cross-DIE: 1580 patched : total: 7222 cross-MC: 4591 cross-DIE: 2631 (+34.0%) (+20.5%) (+66.5%) That's a lot more costly cross-DIE migrations. I think this patch is along the right lines, but there's something fishy happening on this box. [0] - Fork task placement spread measurement: cat /tmp/trace.$1 | grep -E "wakeup_new.*comm=hackbench" | \ sed -e 's/.*target_cpu=//' | sort | uniq -c | awk '{print $1}'