Hello, Vincent. On Thu, Apr 27, 2017 at 10:28:01AM +0200, Vincent Guittot wrote: > On 27 April 2017 at 02:30, Tejun Heo <t...@kernel.org> wrote: > > Hello, Vincent. > > > > On Wed, Apr 26, 2017 at 12:21:52PM +0200, Vincent Guittot wrote: > >> > This is from the follow-up patch. I was confused. Because we don't > >> > propagate decays, we still should decay the runnable_load_avg; > >> > otherwise, we end up accumulating errors in the counter. I'll drop > >> > the last patch. > >> > >> Ok, the runnable_load_avg goes back to 0 when I drop patch 3. But i > >> see runnable_load_avg sometimes significantly higher than load_avg > >> which is normally not possible as load_avg = runnable_load_avg + > >> sleeping task's load_avg > > > > So, while load_avg would eventually converge on runnable_load_avg + > > blocked load_avg given stable enough workload for long enough, > > runnable_load_avg jumping above load avg temporarily is expected, > > No, it's not. Look at load_avg/runnable_avg at root domain when only > task are involved, runnable_load_avg will never be higher than > load_avg because > load_avg = /sum load_avg of tasks attached to the cfs_rq > runnable_load_avg = /sum load_avg of tasks attached and enqueued > to the cfs_rq > load_avg = runnable_load_avg + blocked tasks and as a result > runnable_load_avg is always lower than load_avg.
We don't actually calculate that tho, but yeah, given the calculations we do, runnable shouldn't be significantly higher than load_avg. I see runable going above load_avg by a bit often but the margins are small. I wonder whether you're seeing the errors accumulating from the incorrect min() capping. > And with the propagate load/util_avg patchset, we can even reflect > task migration directly at root domain whereas we had to wait for the > util/load_avg and runnable_load_avg to converge to the new value > before > > Just to confirm one of my assumption, the latency regression was > already there in previous kernel versions and is not a result of > propagate load/util_avg patchset, isn't it ? Yeah, the latency problem is independent of the propagation logic. It's really the meaning of the top level running_avg_load being different w/ and w/o cgroups. > > AFAICS. That's the whole point of it, a sum closely tracking what's > > currently on the cpu so that we can pick the cpu which has the most on > > it now. It doesn't make sense to try to pick threads off of a cpu > > which is generally loaded but doesn't have much going on right now, > > after all. > > The only interest of runnable_load_avg is to be null when a cfs_rq is > idle whereas load_avg is not but not to be higher than load_avg. The > root cause is that load_balance only looks at "load" but not number of > task currently running and that's probably the main problem: > runnable_load_avg has been added because load_balance fails to filter > idle group and idle rq. We should better add a new type in > group_classify to tag group that are idle and the same in > find_busiest_queue more. I'll follow up in the other subthread but there really is fundamental difference in how we calculate runnable_avg w/ and w/o cgroups. Indepndent of whether we can improve the load balancer further, it is an existing bug. > > * Can you please double check that the higher latencies w/ the patch > > is reliably reproducible? The test machines that I use have > > variable management load. They never dominate the machine but are > > enough to disturb the results so that to drawing out a reliable > > pattern takes a lot of repeated runs. I'd really appreciate if you > > could double check that the pattern is reliable with different run > > patterns (ie. instead of 10 consecutive runs after another, > > interleaved). > > I always let time between 2 consecutive run and the 10 consecutive > runs take around 2min to execute > > Then I have run several time these 10 consecutive test and results stay the > same Can you please try the patch at the end of this message? I'm wondering whether you're seeing the errors accumulating from the wrong min(). > > ones. If there's something I can easily buy, please point me to it. > > If there's something I can loan, that'd be great too. > > It looks like most of web site are currently out of stock :( > > * If not, I'll try to clean up the debug patches I have and send them > > your way to get more visiblity but given these things tend to be > > very iterative, it might take quite a few back and forth. > > Yes, that could be usefull. Even a trace of regression could be useful Yeah, will clean it up a bit and post. Will also post some traces. And here's the updated patch. I didn't add the suggested scale_down()'s. I'll follow up on that in the other subthread. Thanks. kernel/sched/fair.c | 47 ++++++++++++++++++++--------------------------- 1 file changed, 20 insertions(+), 27 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3078,37 +3078,30 @@ static inline void update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se) { struct cfs_rq *gcfs_rq = group_cfs_rq(se); - long delta, load = gcfs_rq->avg.load_avg; + long load = 0, delta; /* - * If the load of group cfs_rq is null, the load of the - * sched_entity will also be null so we can skip the formula + * A cfs_rq's load avg contribution to the parent should be scaled + * to the sched_entity's weight. Use freshly calculated shares + * instead of @se->load.weight as the latter may not reflect + * changes from the current scheduling operation. + * + * Note that the propagation source is runnable_load_avg instead of + * load_avg. This keeps every cfs_rq's runnable_load_avg true to + * the sum of the scaled loads of all tasks queued under it, which + * is important for the correct operation of the load balancer. + * + * This can make the sched_entity's load_avg jumpier but that + * correctly reflects what would happen without cgroups if each + * task's load is scaled across nesting - the load is being + * averaged at the task and each cfs_rq. */ - if (load) { - long tg_load; + if (gcfs_rq->load.weight) { + long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg); - /* Get tg's load and ensure tg_load > 0 */ - tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1; - - /* Ensure tg_load >= load and updated with current load*/ - tg_load -= gcfs_rq->tg_load_avg_contrib; - tg_load += load; - - /* - * We need to compute a correction term in the case that the - * task group is consuming more CPU than a task of equal - * weight. A task with a weight equals to tg->shares will have - * a load less or equal to scale_load_down(tg->shares). - * Similarly, the sched_entities that represent the task group - * at parent level, can't have a load higher than - * scale_load_down(tg->shares). And the Sum of sched_entities' - * load must be <= scale_load_down(tg->shares). - */ - if (tg_load > scale_load_down(gcfs_rq->tg->shares)) { - /* scale gcfs_rq's load into tg's shares*/ - load *= scale_load_down(gcfs_rq->tg->shares); - load /= tg_load; - } + load = min_t(long, scale_load_down(shares), + gcfs_rq->runnable_load_avg * + shares / gcfs_rq->load.weight); } delta = load - se->avg.load_avg;