On 1/29/21 9:27 AM, Vincent Guittot wrote:
> 
> The patch below moves the update of the blocked load of CPUs outside 
> newidle_balance().

On a well known database workload, we also saw a lot of overhead to do 
update_blocked_averages
in newidle_balance().  So changes to reduce this overhead is much welcomed.

Turning on cgroup induces 9% throughput degradation on a 2 socket 40 cores per 
socket Icelake system.  

A big part of the overhead in our database workload comes from updating
blocked averages in newidle_balance, caused by I/O threads making
some CPUs go in and out of idle frequently in the following code path:

----__blkdev_direct_IO_simple
          |          
          |----io_schedule_timeout
          |          |          
          |           ----schedule_timeout
          |                     |          
          |                      ----schedule
          |                                |          
          |                                 ----__schedule
          |                                           |          
          |                                            ----pick_next_task_fair
          |                                                      |          
          |                                                       
----newidle_balance
          |                                                                 |   
       
                                                                             
----update_blocked_averages

We found update_blocked_averages() now consumed most CPU time, eating up 2% of 
the CPU cycles once cgroup
gets turned on.

I hacked up Joe's original patch to rate limit the update of blocked
averages called from newidle_balance().  The 9% throughput degradation reduced 
to
5.4%.  We'll be testing Vincent's change to see if it can give
similar performance improvement.

Though in our test environment, sysctl_sched_migration_cost was kept
much lower (25000) compared to the default (500000), to encourage migrations to 
idle cpu
and reduce latency.  We got quite a lot of calls to update_blocked_averages 
directly 
and then try to load_balance in newidle_balance instead of relegating
the responsibility to idle load balancer.  (See code snippet in newidle_balance 
below)  


        if (this_rq->avg_idle < sysctl_sched_migration_cost ||       
<-----sched_migration_cost check
            !READ_ONCE(this_rq->rd->overload)) {

                rcu_read_lock();
                sd = rcu_dereference_check_sched_domain(this_rq->sd);
                if (sd)
                        update_next_balance(sd, &next_balance);
                rcu_read_unlock();

                goto out;  <--- invoke idle load balancer
        }

        raw_spin_unlock(&this_rq->lock);

        update_blocked_averages(this_cpu);

        .... followed by load balance code ---


So the update_blocked_averages offload to idle_load_balancer in Vincent's patch 
is less 
effective in this case with small sched_migration_cost.

Looking at the code a bit more, we don't actually load balance every time in 
this code path
unless our avg_idle time exceeds some threshold.  Doing update_blocked_averages 
immediately 
is only needed if we do call load_balance().  If we don't do any load balance 
in the code path,
we can let the idle load balancer update the blocked averages lazily.

Something like the following perhaps on top of Vincent's patch?  We haven't 
really tested
this change yet but want to see if this change makes sense to you.

Tim

Signed-off-by: Tim Chen <tim.c.c...@linux.intel.com>
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 63950d80fd0b..b93f5f52658a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10591,6 +10591,7 @@ static int newidle_balance(struct rq *this_rq, struct 
rq_flags *rf)
        struct sched_domain *sd;
        int pulled_task = 0;
        u64 curr_cost = 0;
+       bool updated_blocked_avg = false;
 
        update_misfit_status(NULL, this_rq);
        /*
@@ -10627,7 +10628,6 @@ static int newidle_balance(struct rq *this_rq, struct 
rq_flags *rf)
 
        raw_spin_unlock(&this_rq->lock);
 
-       update_blocked_averages(this_cpu);
        rcu_read_lock();
        for_each_domain(this_cpu, sd) {
                int continue_balancing = 1;
@@ -10639,6 +10639,11 @@ static int newidle_balance(struct rq *this_rq, struct 
rq_flags *rf)
                }
 
                if (sd->flags & SD_BALANCE_NEWIDLE) {
+                       if (!updated_blocked_avg) {
+                               update_blocked_averages(this_cpu);
+                               updated_blocked_avg = true;
+                       }
+
                        t0 = sched_clock_cpu(this_cpu);
 
                        pulled_task = load_balance(this_cpu, this_rq,
 

Reply via email to