On 28 April 2017 at 18:14, Tejun Heo <t...@kernel.org> wrote: > Hello, Vincent. > >> >> The only interest of runnable_load_avg is to be null when a cfs_rq is >> idle whereas load_avg is not but not to be higher than load_avg. The >> root cause is that load_balance only looks at "load" but not number of >> task currently running and that's probably the main problem: >> runnable_load_avg has been added because load_balance fails to filter >> idle group and idle rq. We should better add a new type in >> group_classify to tag group that are idle and the same in >> find_busiest_queue more. > > I'll follow up in the other subthread but there really is fundamental > difference in how we calculate runnable_avg w/ and w/o cgroups. > Indepndent of whether we can improve the load balancer further, it is > an existing bug.
I'd like to weight that a bit. The runnable_load_avg works correctly as it is because it reflects correctly the load of runnable entity at root domain If you start to propagate the runnable_load_avg on the load_avg of the group entity, the load will become unstable. runnable_load_avg has been added to fix load_balance being unable to select the right busiest rq. So the goal is to use more and more load_avg not the opposite > >> > * Can you please double check that the higher latencies w/ the patch >> > is reliably reproducible? The test machines that I use have >> > variable management load. They never dominate the machine but are >> > enough to disturb the results so that to drawing out a reliable >> > pattern takes a lot of repeated runs. I'd really appreciate if you >> > could double check that the pattern is reliable with different run >> > patterns (ie. instead of 10 consecutive runs after another, >> > interleaved). >> >> I always let time between 2 consecutive run and the 10 consecutive >> runs take around 2min to execute >> >> Then I have run several time these 10 consecutive test and results stay the >> same > > Can you please try the patch at the end of this message? I'm > wondering whether you're seeing the errors accumulating from the wrong > min(). I still have the regression with the patch below. The regression comes from the use runnable_load_avg to propagate. As load_avg becomes null at some point, it break the computation of share and the load_avg stay very low > >> > ones. If there's something I can easily buy, please point me to it. >> > If there's something I can loan, that'd be great too. >> >> It looks like most of web site are currently out of stock > > :( > >> > * If not, I'll try to clean up the debug patches I have and send them >> > your way to get more visiblity but given these things tend to be >> > very iterative, it might take quite a few back and forth. >> >> Yes, that could be usefull. Even a trace of regression could be useful > > Yeah, will clean it up a bit and post. Will also post some traces. > > And here's the updated patch. I didn't add the suggested > scale_down()'s. I'll follow up on that in the other subthread. > Thanks. > > kernel/sched/fair.c | 47 ++++++++++++++++++++--------------------------- > 1 file changed, 20 insertions(+), 27 deletions(-) > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3078,37 +3078,30 @@ static inline void > update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se) > { > struct cfs_rq *gcfs_rq = group_cfs_rq(se); > - long delta, load = gcfs_rq->avg.load_avg; > + long load = 0, delta; > > /* > - * If the load of group cfs_rq is null, the load of the > - * sched_entity will also be null so we can skip the formula > + * A cfs_rq's load avg contribution to the parent should be scaled > + * to the sched_entity's weight. Use freshly calculated shares > + * instead of @se->load.weight as the latter may not reflect > + * changes from the current scheduling operation. > + * > + * Note that the propagation source is runnable_load_avg instead of > + * load_avg. This keeps every cfs_rq's runnable_load_avg true to > + * the sum of the scaled loads of all tasks queued under it, which > + * is important for the correct operation of the load balancer. > + * > + * This can make the sched_entity's load_avg jumpier but that > + * correctly reflects what would happen without cgroups if each > + * task's load is scaled across nesting - the load is being > + * averaged at the task and each cfs_rq. > */ > - if (load) { > - long tg_load; > + if (gcfs_rq->load.weight) { > + long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg); > > - /* Get tg's load and ensure tg_load > 0 */ > - tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1; > - > - /* Ensure tg_load >= load and updated with current load*/ > - tg_load -= gcfs_rq->tg_load_avg_contrib; > - tg_load += load; > - > - /* > - * We need to compute a correction term in the case that the > - * task group is consuming more CPU than a task of equal > - * weight. A task with a weight equals to tg->shares will have > - * a load less or equal to scale_load_down(tg->shares). > - * Similarly, the sched_entities that represent the task group > - * at parent level, can't have a load higher than > - * scale_load_down(tg->shares). And the Sum of sched_entities' > - * load must be <= scale_load_down(tg->shares). > - */ > - if (tg_load > scale_load_down(gcfs_rq->tg->shares)) { > - /* scale gcfs_rq's load into tg's shares*/ > - load *= scale_load_down(gcfs_rq->tg->shares); > - load /= tg_load; > - } > + load = min_t(long, scale_load_down(shares), > + gcfs_rq->runnable_load_avg * > + shares / gcfs_rq->load.weight); > } > > delta = load - se->avg.load_avg;