On 2 May 2017 at 22:56, Tejun Heo <t...@kernel.org> wrote: > Hello, Vincent. > > On Tue, May 02, 2017 at 08:56:52AM +0200, Vincent Guittot wrote: >> On 28 April 2017 at 18:14, Tejun Heo <t...@kernel.org> wrote: >> > I'll follow up in the other subthread but there really is fundamental >> > difference in how we calculate runnable_avg w/ and w/o cgroups. >> > Indepndent of whether we can improve the load balancer further, it is >> > an existing bug. >> >> I'd like to weight that a bit. >> The runnable_load_avg works correctly as it is because it reflects >> correctly the load of runnable entity at root domain >> If you start to propagate the runnable_load_avg on the load_avg of the >> group entity, the load will become unstable. >> runnable_load_avg has been added to fix load_balance being unable to >> select the right busiest rq. So the goal is to use more and more >> load_avg not the opposite > > I have a hard time understanding what you're trying to say here. > > Without cgroup, the load balancer uses the sum of load_avgs of the > running tasks on the queue. As shown by the debug trace, the load > balancer repeatedly ends up picking the wrong CPU when cgroup is > involved because it ends up including the load_avgs of nested blocked > tasks into runnable_load_avg of root - we lose the distinction between > running and blocked load_avgs when we pass through a nested cfs_rq. > > We can further improve the load balancer all we want, for example, > right now, we would end up picking a CPU with one task which has a > really high weight over another CPU with two normal weight tasks even, > which isn't ideal; however, there is something obviously broken in the > existing mechanism and we want to fix that first independent of > further improvements, and it won't be a good idea to paper over an > existing problem with a different mechanism either.
That's probably where I disagree. IMO, runnable_load_avg is already a fix to compensate the fact when PELT has been rewritten, the load balance has not been update accordingly and was unable to make the right choice between 2 groups: one group having only 1 task but a higher load and another group with several tasks but a lower load (either because of the blocked load or because a runnable task with far higher load than others). runnable_load_avg has been put back to quickly fix the 1st case but it doesn't fix the other. There were also some issues of propagating load update with task migration but this has been fixed now. So we have 2 use case with the exact same behavior a group with a higher load (runnable_load_avg or load_avg) but only 1 task and another one with several tasks but a lower load. At now runnable_load_avg can only fix one with the impact of breaking load_avg behavior whereas one fix in the load_balance can fix both withour breaking load_avg behavior. We should better work on removing runnable_load_avg completely than adding more stuff on it > >> >> I always let time between 2 consecutive run and the 10 consecutive >> >> runs take around 2min to execute >> >> >> >> Then I have run several time these 10 consecutive test and results stay >> >> the same >> > >> > Can you please try the patch at the end of this message? I'm >> > wondering whether you're seeing the errors accumulating from the wrong >> > min(). >> >> I still have the regression with the patch below. >> The regression comes from the use runnable_load_avg to propagate. As >> load_avg becomes null at some point, it break the computation of share >> and the load_avg stay very low > > That's surprising given that what the patch does is bringing the > cgroup behavior closer to !cgroup behavior. It'd be great to be able But it's also break load_avg which is used to calculate per cfs_rq share and this change generates the regression on my board > to reproduce the problem and trace it. It looks like the board is > pretty standardized. Would the following be equivalent to the one you > have? > > http://a.co/f3dD1lm Yes it's this board i have the previous version but it should not change the behavior > > If so, I can just buy it, get your test image and repro it here and > trace why the regression is happening with the setup. We might be > hitting something else. I have pushed my branch which includes v4.11-rc8 + your patches + some debug trace here: https://git.linaro.org/people/vincent.guittot/kernel.git/log/?h=test-tejun-patches and uploaded a trace_cmd's trace that shows the regression http://people.linaro.org/~vincent.guittot/trace-tejun-patches As an example, we can easily see arounf time stamp 94.826469 sec that cpu5 is idle while CPU2 and CPU3 share their time between several task. This doesn't happens without your patch > > Thanks. > > -- > tejun