Re: [PATCH 3/3] sched: limit cpu search and rotate search window for scalability

Peter Zijlstra Wed, 25 Apr 2018 11:02:39 -0700

On Wed, Apr 25, 2018 at 05:36:00PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 24, 2018 at 05:10:34PM -0700, Subhra Mazumdar wrote:
> > On 04/24/2018 05:53 AM, Peter Zijlstra wrote:
> 
> > > Why do you need to put a max on? Why isn't the proportional thing
> > > working as is? (is the average no good because of big variance or what)
> 
> > Firstly the choosing of 512 seems arbitrary.
> 
> It is; it is a crud attempt to deal with big variance. The comment says
> as much.
> 
> > Secondly the logic here is that the enqueuing cpu should search up to
> > time it can get work itself.  Why is that the optimal amount to
> > search?
> 
> 1/512-th of the time in fact, per the above random number, but yes.
> Because searching for longer than we're expecting to be idle for is
> clearly bad, at that point we're inhibiting doing useful work.
> 
> But while thinking about all this, I think I've spotted a few more
> issues, aside from the variance:
> 
> Firstly, while avg_idle estimates the average duration for _when_ we go
> idle, it doesn't give a good measure when we do not in fact go idle. So
> when we get wakeups while fully busy, avg_idle is a poor measure.
> 
> Secondly, the number of wakeups performed is also important. If we have
> a lot of wakeups, we need to look at aggregate wakeup time over a
> period. Not just single wakeup time.
> 
> And thirdly, we're sharing the idle duration with newidle balance.
> 
> And I think the 512 is a result of me not having recognised these
> additional issues when looking at the traces, I saw variance and left it
> there.
> 
> 
> This leaves me thinking we need a better estimator for wakeups. Because
> if there really is significant idle time, not looking for idle CPUs to
> run on is bad. Placing that upper limit, especially such a low one, is
> just an indication of failure.
> 
> 
> I'll see if I can come up with something.


Something like so _could_ work. Again, completely untested. We give idle
time to wake_avg, we subtract select_idle_sibling 'runtime' from
wake_avg, such that when there's lots of wakeups we don't use more time
than there was reported idle time. And we age wake_avg, such that if
there hasn't been idle for a number of ticks (we've been real busy) we
also stop scanning wide.

But it could eat your granny and set your cat on fire :-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e10aaeebfcc..bc910e5776cc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1671,6 +1671,9 @@ static void ttwu_do_wakeup(struct rq *rq, struct 
task_struct *p, int wake_flags,
                if (rq->avg_idle > max)
                        rq->avg_idle = max;
 
+               rq->wake_stamp = jiffies;
+               rq->wake_avg = rq->avg_idle / 2;
+
                rq->idle_stamp = 0;
        }
 #endif
@@ -6072,6 +6075,8 @@ void __init sched_init(void)
                rq->online = 0;
                rq->idle_stamp = 0;
                rq->avg_idle = 2*sysctl_sched_migration_cost;
+               rq->wake_stamp = jiffies;
+               rq->wake_avg = rq->avg_idle;
                rq->max_idle_balance_cost = sysctl_sched_migration_cost;
 
                INIT_LIST_HEAD(&rq->cfs_tasks);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 54dc31e7ab9b..fee31dbe15ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6369,7 +6369,9 @@ static inline int select_idle_smt(struct task_struct *p, 
struct sched_domain *sd
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int 
target)
 {
        struct sched_domain *this_sd;
+       unsigned long now = jiffies;
        u64 avg_cost, avg_idle;
+       struct rq *this_rq;
        u64 time, cost;
        s64 delta;
        int cpu, nr = INT_MAX;
@@ -6378,11 +6380,18 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
        if (!this_sd)
                return -1;
 
-       /*
-        * Due to large variance we need a large fuzz factor; hackbench in
-        * particularly is sensitive here.
-        */
-       avg_idle = this_rq()->avg_idle / 512;
+       this_rq = this_rq();
+       if (sched_feat(SIS_NEW)) {
+               /* age the remaining idle time */
+               if (unlikely(this_rq->wake_stamp < now)) {
+                       while (this_rq->wake_stamp < now && this_rq->wake_avg)
+                               this_rq->wake_avg >>= 1;
+               }
+               avg_idle = this_rq->wake_avg;
+       } else {
+               avg_idle = this_rq->avg_idle / 512;
+       }
+
        avg_cost = this_sd->avg_scan_cost + 1;
 
        if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
@@ -6412,6 +6421,12 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
        delta = (s64)(time - cost) / 8;
        this_sd->avg_scan_cost += delta;
 
+       /* you can only spend the time once */
+       if (this_rq->wake_avg > time)
+               this_rq->wake_avg -= time;
+       else
+               this_rq->wake_avg = 0;
+
        return cpu;
 }
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 85ae8488039c..f5f86a59aac4 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_NEW, false)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 15750c222ca2..c4d6ddf907b5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -831,6 +831,9 @@ struct rq {
        u64                     idle_stamp;
        u64                     avg_idle;
 
+       unsigned long           wake_stamp;
+       u64                     wake_avg;
+
        /* This is used to determine avg_idle's max value */
        u64                     max_idle_balance_cost;
 #endif

Re: [PATCH 3/3] sched: limit cpu search and rotate search window for scalability

Reply via email to