On Mon, 10 Oct, at 07:34:40PM, Vincent Guittot wrote:
> 
> Subject: [PATCH] sched: use load_avg for selecting idlest group
> 
> select_busiest_group only compares the runnable_load_avg when looking for
> the idlest group. But on fork intensive use case like hackbenchw here task
> blocked quickly after the fork, this can lead to selecting the same CPU
> whereas other CPUs, which have similar runnable load but a lower load_avg,
> could be chosen instead.
> 
> When the runnable_load_avg of 2 CPUs are close, we now take into account
> the amount of blocked load as a 2nd selection factor.
> 
> For use case like hackbench, this enable the scheduler to select different
> CPUs during the fork sequence and to spread tasks across the system.
> 
> Tests have been done on a Hikey board (ARM based octo cores) for several
> kernel. The result below gives min, max, avg and stdev values of 18 runs
> with each configuration.
> 
> The v4.8+patches configuration also includes the changes below which is part 
> of the
> proposal made by Peter to ensure that the clock will be up to date when the
> fork task will be attached to the rq.
> 
> @@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p)
>       __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
>  #endif
>       rq = __task_rq_lock(p, &rf);
> +     update_rq_clock(rq);
>       post_init_entity_util_avg(&p->se);
>  
>       activate_task(rq, p, 0);
> 
> hackbench -P -g 1 
> 
>        ea86cb4b7621  7dc603c9028e  v4.8        v4.8+patches
> min    0.049         0.050         0.051       0,048
> avg    0.057         0.057(0%)     0.057(0%)   0,055(+5%)
> max    0.066         0.068         0.070       0,063
> stdev  +/-9%         +/-9%         +/-8%       +/-9%
> 
> Signed-off-by: Vincent Guittot <vincent.guit...@linaro.org>
> ---
>  kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++--------
>  1 file changed, 32 insertions(+), 8 deletions(-)

This patch looks pretty good to me and this 2-socket 48-cpu Xeon
(domain0 SMT, domain1 MC, domain2 NUMA) shows a few nice performance
improvements, and no regressions for various combinations of hackbench
sockets/pipes and group numbers.

But on a 2-socket 8-cpu Xeon (domain0 MC, domain1 DIE) running,

  perf stat --null -r 25 -- hackbench -pipe 30 process 1000

I see a regression,

  baseline: 2.41228
  patched : 2.64528 (-9.7%)

Even though the spread of tasks during fork[0] is improved,

  baseline CV: 0.478%
  patched CV : 0.042%

Clearly the spread wasn't *that* bad to begin with on this machine for
this workload. I consider the baseline spread to be pretty well
distributed. Some other factor must be at play.

Patched runqueue latencies are higher (max9* are percentiles),

  baseline: mean: 615932.69 max90: 75272.00 max95: 175985.00 max99: 5884778.00 
max: 1694084747.00 
  patched: mean : 882026.28 max90: 92015.00 max95: 291760.00 max99: 7590167.00 
max: 1841154776.00

And there are more migrations of hackbench tasks,

  baseline: total: 5390 cross-MC: 3810 cross-DIE: 1580
  patched : total: 7222 cross-MC: 4591 cross-DIE: 2631
                 (+34.0%)       (+20.5%)        (+66.5%)

That's a lot more costly cross-DIE migrations. I think this patch is
along the right lines, but there's something fishy happening on this
box.

[0] - Fork task placement spread measurement:

      cat /tmp/trace.$1 | grep -E "wakeup_new.*comm=hackbench" | \
        sed -e 's/.*target_cpu=//' | sort | uniq -c | awk '{print $1}' 

Reply via email to