On 28 April 2017 at 22:33, Tejun Heo <t...@kernel.org> wrote: > Hello, Vincent. > > On Thu, Apr 27, 2017 at 10:29:10AM +0200, Vincent Guittot wrote: >> On 27 April 2017 at 00:52, Tejun Heo <t...@kernel.org> wrote: >> > Hello, >> > >> > On Wed, Apr 26, 2017 at 08:12:09PM +0200, Vincent Guittot wrote: >> >> On 24 April 2017 at 22:14, Tejun Heo <t...@kernel.org> wrote: >> >> Can the problem be on the load balance side instead ? and more >> >> precisely in the wakeup path ? >> >> After looking at the trace, it seems that task placement happens at >> >> wake up path and if it fails to select the right idle cpu at wake up, >> >> you will have to wait for a load balance which is alreayd too late >> > >> > Oh, I was tracing most of scheduler activities and the ratios of >> > wakeups picking idle CPUs were about the same regardless of cgroup >> > membership. I can confidently say that the latency issue that I'm >> > seeing is from load balancer picking the wrong busiest CPU, which is >> > not to say that there can be other problems. >> >> ok. Is there any trace that you can share ? your behavior seems >> different of mine > > I'm attaching the debug patch. With your change (avg instead of > runnable_avg), the following trace shows why it's wrong. > > It's dumping a case where group A has a CPU w/ more than two schbench > threads and B doesn't, but the load balancer is determining that B is > loaded heavier. > > dbg_odd: odd: dst=28 idle=2 brk=32 lbtgt=0-31 type=2 > dbg_odd_dump: A: grp=1,17 w=2 avg=7.247 grp=8.337 sum=8.337 pertask=2.779 > dbg_odd_dump: A: gcap=1.150 gutil=1.095 run=3 idle=0 gwt=2 type=2 nocap=1 > dbg_odd_dump: A: CPU001: run=1 schb=1 > dbg_odd_dump: A: Q001-asdf: w=1.000,l=0.525,u=0.513,r=0.527 run=1 hrun=1 > tgs=100.000 tgw=17.266 > dbg_odd_dump: A: Q001-asdf: schbench(153757C):w=1.000,l=0.527,u=0.514 > dbg_odd_dump: A: Q001-/: w=5.744,l=2.522,u=0.520,r=3.067 run=1 hrun=1 > tgs=1.000 tgw=0.000 > dbg_odd_dump: A: Q001-/: asdf(C):w=5.744,l=3.017,u=0.521 > dbg_odd_dump: A: CPU017: run=2 schb=2 > dbg_odd_dump: A: Q017-asdf: w=2.000,l=0.989,u=0.966,r=0.988 run=2 hrun=2 > tgs=100.000 tgw=17.266 > dbg_odd_dump: A: Q017-asdf: schbench(153737C):w=1.000,l=0.493,u=0.482 > schbench(153739):w=1.000,l=0.494,u=0.483 > dbg_odd_dump: A: Q017-/: w=10.653,l=7.888,u=0.973,r=5.270 run=1 hrun=2 > tgs=1.000 tgw=0.000 > dbg_odd_dump: A: Q017-/: asdf(C):w=10.653,l=5.269,u=0.966 > dbg_odd_dump: B: grp=14,30 w=2 avg=7.666 grp=8.819 sum=8.819 pertask=4.409 > dbg_odd_dump: B: gcap=1.150 gutil=1.116 run=2 idle=0 gwt=2 type=2 nocap=1 > dbg_odd_dump: B: CPU014: run=1 schb=1 > dbg_odd_dump: B: Q014-asdf: w=1.000,l=1.004,u=0.970,r=0.492 run=1 hrun=1 > tgs=100.000 tgw=17.266 > dbg_odd_dump: B: Q014-asdf: schbench(153760C):w=1.000,l=0.491,u=0.476 > dbg_odd_dump: B: Q014-/: w=5.605,l=11.146,u=0.970,r=5.774 run=1 hrun=1 > tgs=1.000 tgw=0.000 > dbg_odd_dump: B: Q014-/: asdf(C):w=5.605,l=5.766,u=0.970 > dbg_odd_dump: B: CPU030: run=1 schb=1 > dbg_odd_dump: B: Q030-asdf: w=1.000,l=0.538,u=0.518,r=0.558 run=1 hrun=1 > tgs=100.000 tgw=17.266 > dbg_odd_dump: B: Q030-asdf: schbench(153747C):w=1.000,l=0.537,u=0.516 > dbg_odd_dump: B: Q030-/: w=5.758,l=3.186,u=0.541,r=3.044 run=1 hrun=1 > tgs=1.000 tgw=0.000 > dbg_odd_dump: B: Q030-/: asdf(C):w=5.758,l=3.092,u=0.516 > > You can notice that B's pertask weight is 4.409 which is way higher > than A's 2.779, and this is from Q014-asdf's contribution to Q014-/ is > twice as high as it should be. The root queue's runnable avg should
Are you sure that this is because of blocked load in group A ? it can be that Q014-asdf has already have to wait before running and its load still increase while runnable but not running . IIUC your trace, group A has 2 running tasks and group B only one but load_balance selects B because of its sgs->avg_load being higher. But this can also happen even if runnable_load_avg of child cfs_rq was propagated correctly in group entity because we can have situation where a group A has only 1 task with higher load than 2 tasks on groupB and even if blocked load is not taken into account, and load_balance will select A. IMHO, we should better improve load balance selection. I'm going to add smarter group selection in load_balance. that's something we should have already done but it was difficult without load/util_avg propagation. it should be doable now > only contain what's currently active but because we're scaling load > avg which includes both active and blocked, we're ending up picking > group B over A. > > This shows up in the total number of times we pick the wrong queue and > thus latency. I'm running the following script with the debug patch > applied. > > #!/bin/bash > > date > cat /proc/self/cgroup > > echo 1000 > /sys/module/fair/parameters/dbg_odd_nth > echo 0 > /sys/module/fair/parameters/dbg_odd_cnt > > ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 > > cat /sys/module/fair/parameters/dbg_odd_cnt > > > With your patch applied, in the root cgroup, > > Fri Apr 28 12:48:59 PDT 2017 > 0::/ > Latency percentiles (usec) > 50.0000th: 26 > 75.0000th: 63 > 90.0000th: 78 > 95.0000th: 88 > *99.0000th: 707 > 99.5000th: 5096 > 99.9000th: 10352 > min=0, max=13743 > 577 > > > In the /asdf cgroup, > > Fri Apr 28 13:19:53 PDT 2017 > 0::/asdf > Latency percentiles (usec) > 50.0000th: 35 > 75.0000th: 67 > 90.0000th: 81 > 95.0000th: 98 > *99.0000th: 2212 > 99.5000th: 4536 > 99.9000th: 11024 > min=0, max=13026 > 1708 > > > The last line is the number of times the load balancer picked a group > w/o more than two schbench threads on a CPU over one w/. Some number > of these are expected as there are other threads and there are some > plays in all the calculations but propgating avg or not propgating at > all significantly increases the count and latency. > >> > The issue isn't about whether runnable_load_avg or load_avg should be >> > used but the unexpected differences in the metrics that the load >> >> I think that's the root of the problem. I explain a bit more my view >> on the other thread > > So, when picking the busiest group, the only thing which matters is > the queue's runnable_load_avg, which should approximate the sum of all > on-queue loads on that CPU. > > If we don't propagate or propagate load_avg, we're factoring in > blocked avg of descendent cgroups into the root's runnable_load_avg > which is obviously wrong. > > We can argue whether overriding a cfs_rq se's load_avg to the scaled > runnable_load_avg of the cfs_rq is the right way to go or we should > introduce a separate channel to propagate runnable_load_avg; however, > it's clear that we need to fix runnable_load_avg propagation one way > or another. The minimum would be to not break load_avg > > The thing with cfs_rq se's load_avg is that, it isn't really used > anywhere else AFAICS, so overriding it to the cfs_rq's > runnable_load_avg isn't prettiest but doesn't really change anything. load_avg is used for defining the share of each cfs_rq. > > Thanks. > > -- > tejun