Hi Tim, Le mardi 23 mars 2021 à 14:37:59 (-0700), Tim Chen a écrit : > > > On 1/29/21 9:27 AM, Vincent Guittot wrote: > > > > The patch below moves the update of the blocked load of CPUs outside > > newidle_balance(). > > On a well known database workload, we also saw a lot of overhead to do > update_blocked_averages > in newidle_balance(). So changes to reduce this overhead is much welcomed. > > Turning on cgroup induces 9% throughput degradation on a 2 socket 40 cores > per socket Icelake system. > > A big part of the overhead in our database workload comes from updating > blocked averages in newidle_balance, caused by I/O threads making > some CPUs go in and out of idle frequently in the following code path: > > ----__blkdev_direct_IO_simple > | > |----io_schedule_timeout > | | > | ----schedule_timeout > | | > | ----schedule > | | > | ----__schedule > | | > | ----pick_next_task_fair > | | > | > ----newidle_balance > | | > > > ----update_blocked_averages > > We found update_blocked_averages() now consumed most CPU time, eating up 2% > of the CPU cycles once cgroup > gets turned on. > > I hacked up Joe's original patch to rate limit the update of blocked > averages called from newidle_balance(). The 9% throughput degradation > reduced to > 5.4%. We'll be testing Vincent's change to see if it can give > similar performance improvement. > > Though in our test environment, sysctl_sched_migration_cost was kept > much lower (25000) compared to the default (500000), to encourage migrations > to idle cpu > and reduce latency. We got quite a lot of calls to update_blocked_averages > directly > and then try to load_balance in newidle_balance instead of relegating > the responsibility to idle load balancer. (See code snippet in > newidle_balance below) > > > if (this_rq->avg_idle < sysctl_sched_migration_cost || > <-----sched_migration_cost check > !READ_ONCE(this_rq->rd->overload)) { > > rcu_read_lock(); > sd = rcu_dereference_check_sched_domain(this_rq->sd); > if (sd) > update_next_balance(sd, &next_balance); > rcu_read_unlock(); > > goto out; <--- invoke idle load balancer > } > > raw_spin_unlock(&this_rq->lock); > > update_blocked_averages(this_cpu); > > .... followed by load balance code --- > > So the update_blocked_averages offload to idle_load_balancer in Vincent's > patch is less > effective in this case with small sched_migration_cost. > > Looking at the code a bit more, we don't actually load balance every time in > this code path > unless our avg_idle time exceeds some threshold. Doing > update_blocked_averages immediately
IIUC your problem, we call update_blocked_averages() but because of: if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) { update_next_balance(sd, &next_balance); break; } the for_each_domain loop stops even before running load_balance on the 1st sched domain level which means that update_blocked_averages() was called unnecessarily. And this is even more true with a small sysctl_sched_migration_cost which allows newly idle LB for very small this_rq->avg_idle. We could wonder why you set such a low value for sysctl_sched_migration_cost which is lower than the max_newidle_lb_cost of the smallest domain but that's probably because of task_hot(). if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st sched_domain, we should skip spin_unlock/lock and for_each_domain() loop entirely Maybe something like below: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 76e33a70d575..08933e0d87ed 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10723,17 +10723,21 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) */ rq_unpin_lock(this_rq, rf); + rcu_read_lock(); + sd = rcu_dereference_check_sched_domain(this_rq->sd); + if (this_rq->avg_idle < sysctl_sched_migration_cost || - !READ_ONCE(this_rq->rd->overload)) { + !READ_ONCE(this_rq->rd->overload) || + (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) { - rcu_read_lock(); - sd = rcu_dereference_check_sched_domain(this_rq->sd); if (sd) update_next_balance(sd, &next_balance); rcu_read_unlock(); goto out; } + rcu_read_unlock(); + raw_spin_unlock(&this_rq->lock); > is only needed if we do call load_balance(). If we don't do any load balance > in the code path, > we can let the idle load balancer update the blocked averages lazily. > > Something like the following perhaps on top of Vincent's patch? We haven't > really tested > this change yet but want to see if this change makes sense to you. > > Tim > > Signed-off-by: Tim Chen <tim.c.c...@linux.intel.com> > --- > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 63950d80fd0b..b93f5f52658a 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -10591,6 +10591,7 @@ static int newidle_balance(struct rq *this_rq, struct > rq_flags *rf) > struct sched_domain *sd; > int pulled_task = 0; > u64 curr_cost = 0; > + bool updated_blocked_avg = false; > > update_misfit_status(NULL, this_rq); > /* > @@ -10627,7 +10628,6 @@ static int newidle_balance(struct rq *this_rq, struct > rq_flags *rf) > > raw_spin_unlock(&this_rq->lock); > > - update_blocked_averages(this_cpu); > rcu_read_lock(); > for_each_domain(this_cpu, sd) { > int continue_balancing = 1; > @@ -10639,6 +10639,11 @@ static int newidle_balance(struct rq *this_rq, > struct rq_flags *rf) > } > > if (sd->flags & SD_BALANCE_NEWIDLE) { > + if (!updated_blocked_avg) { > + update_blocked_averages(this_cpu); > + updated_blocked_avg = true; > + } > + > t0 = sched_clock_cpu(this_cpu); > > pulled_task = load_balance(this_cpu, this_rq, > >