Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide

Atish Patra Wed, 20 Sep 2017 12:37:52 -0700

On 08/11/2017 04:45 AM, Brendan Jackman wrote:

This patch adds a parameter to select_task_rq, sibling_count_hint
allowing the caller, where it has this information, to inform the
sched_class the number of tasks that are being woken up as part of
the same event.


The wake_q mechanism is one case where this information is available.

select_task_rq_fair can then use the information to detect that it
needs to widen the search space for task placement in order to avoid
overloading the last-level cache domain's CPUs.

                               * * *

The reason I am investigating this change is the following use case
on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which
all repeatedly do X amount of work then
pthread_barrier_wait (i.e. sleep until the last task finishes its X
and hits the barrier).


Are all these tasks are homogeneous i.e. does exactly equal amount of work?

 On big.LITTLE, the tasks which get a "big" CPU

finish faster, and then those CPUs pull over the tasks that are still
running:

     v CPU v           ->time->

                    -------------
   0  (big)         11111  /333
                    -------------
   1  (big)         22222   /444|
                    -------------
   2  (LITTLE)      333333/
                    -------------
   3  (LITTLE)      444444/
                    -------------

Now when task 4 hits the barrier (at |) and wakes the others up,
there are 4 tasks with prev_cpu=<big> and 0 tasks with
prev_cpu=<little>. want_affine therefore means that we'll only look
in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled
on the bigs until the next load balance, something like this:

     v CPU v           ->time->

                    ------------------------
   0  (big)         11111  /333  31313\33333
                    ------------------------
   1  (big)         22222   /444|424\4444444
                    ------------------------
   2  (LITTLE)      333333/          \222222
                    ------------------------
   3  (LITTLE)      444444/            \1111
                    ------------------------
                                 ^^^
                           underutilization

So, I'm trying to get want_affine = 0 for these tasks.

I don't _think_ any incarnation of the wakee_flips mechanism can help
us here because which task is waker and which tasks are wakees
generally changes with each iteration.

However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the
nice property that we know exactly how many tasks are being woken, so
we can cheat.

It might be a disadvantage that we "widen" _every_ task that's woken in
an event, while select_idle_sibling would work fine for the first
sd_llc_size - 1 tasks.

IIUC, if wake_affine() behaves correctly this trick wouldn't be
necessary on SMP systems, so it might be best guarded by the presence
of SD_ASYM_CPUCAPACITY?

                               * * *

Final note..

In order to observe "perfect" behaviour for this use case, I also had
to disable the TTWU_QUEUE sched feature. Suppose during the wakeup
above we are working through the work queue and have placed tasks 3
and 2, and are about to place task 1:

     v CPU v           ->time->

                    --------------
   0  (big)         11111  /333  3
                    --------------
   1  (big)         22222   /444|4
                    --------------
   2  (LITTLE)      333333/      2
                    --------------
   3  (LITTLE)      444444/          <- Task 1 should go here
                    --------------

If TTWU_QUEUE is enabled, we will not yet have enqueued task
2 (having instead sent a reschedule IPI) or attached its load to CPU
2. So we are likely to also place task 1 on cpu 2. Disabling
TTWU_QUEUE means that we enqueue task 2 before placing task 1,
solving this issue. TTWU_QUEUE is there to minimise rq lock
contention, and I guess that this contention is less of an issue on
big.LITTLE systems since they have relatively few CPUs, which
suggests the trade-off makes sense here.

Signed-off-by: Brendan Jackman <brendan.jack...@arm.com>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Josef Bacik <jo...@toxicpanda.com>
Cc: Joel Fernandes <joe...@google.com>
Cc: Mike Galbraith <efa...@gmx.de>
Cc: Matt Fleming <m...@codeblueprint.co.uk>
---
 include/linux/sched/wake_q.h |  2 ++
 kernel/sched/core.c          | 34 +++++++++++++++++++++++-----------
 kernel/sched/deadline.c      |  3 ++-
 kernel/sched/fair.c          | 17 +++++++++++------
 kernel/sched/idle_task.c     |  3 ++-
 kernel/sched/rt.c            |  3 ++-
 kernel/sched/sched.h         |  3 ++-
 kernel/sched/stop_task.c     |  3 ++-
 8 files changed, 46 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
index d03d8a9047dc..607a888eb35b 100644
--- a/include/linux/sched/wake_q.h
+++ b/include/linux/sched/wake_q.h
@@ -33,6 +33,7 @@
 struct wake_q_head {
        struct wake_q_node *first;
        struct wake_q_node **lastp;
+       int count;
 };

Instead of passing around the head count, can we store the count intask_struct ? The patch would be lot less invasive for a single use-case.

 #define WAKE_Q_TAIL ((struct wake_q_node *) 0x01)
@@ -44,6 +45,7 @@ static inline void wake_q_init(struct wake_q_head *head)
 {
        head->first = WAKE_Q_TAIL;
        head->lastp = &head->first;
+       head->count = 0;
 }

 extern void wake_q_add(struct wake_q_head *head,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0869b20fba81..ddf9257b1467 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -438,6 +438,8 @@ void wake_q_add(struct wake_q_head *head, struct 
task_struct *task)
        if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
                return;

+       head->count++;
+
        get_task_struct(task);

        /*
@@ -447,6 +449,10 @@ void wake_q_add(struct wake_q_head *head, struct 
task_struct *task)
        head->lastp = &node->next;
 }

+static int
+try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
+              int sibling_count_hint);
+
 void wake_up_q(struct wake_q_head *head)
 {
        struct wake_q_node *node = head->first;
@@ -461,10 +467,10 @@ void wake_up_q(struct wake_q_head *head)
                task->wake_q.next = NULL;

                /*
-                * wake_up_process() implies a wmb() to pair with the queueing
+                * try_to_wake_up() implies a wmb() to pair with the queueing
                 * in wake_q_add() so as not to miss wakeups.
                 */
-               wake_up_process(task);
+               try_to_wake_up(task, TASK_NORMAL, 0, head->count);
                put_task_struct(task);
        }
 }
@@ -1527,12 +1533,14 @@ static int select_fallback_rq(int cpu, struct 
task_struct *p)
  * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
  */
 static inline
-int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int 
wake_flags)
+int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int 
wake_flags,
+                  int sibling_count_hint)
 {
        lockdep_assert_held(&p->pi_lock);

        if (p->nr_cpus_allowed > 1)
-               cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, 
wake_flags);
+               cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, 
wake_flags,
+                                                    sibling_count_hint);
        else
                cpu = cpumask_any(&p->cpus_allowed);

@@ -1944,6 +1952,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, 
int wake_flags)
  * @p: the thread to be awakened
  * @state: the mask of task states that can be woken
  * @wake_flags: wake modifier flags (WF_*)
+ * @sibling_count_hint: A hint at the number of threads that are being woken up
+ *                      in this event.


I also had the same thought as Joel about the naming.
Probably wakee_count ?

  *
  * If (@state & @p->state) @p->state = TASK_RUNNING.
  *
@@ -1956,7 +1966,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, 
int wake_flags)
  *        %false otherwise.
  */
 static int
-try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
+              int sibling_count_hint)
 {
        unsigned long flags;
        int cpu, success = 0;
@@ -2042,7 +2053,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, 
int wake_flags)
                atomic_dec(&task_rq(p)->nr_iowait);
        }

-       cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
+       cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags,
+                            sibling_count_hint);
        if (task_cpu(p) != cpu) {
                wake_flags |= WF_MIGRATED;
                set_task_cpu(p, cpu);
@@ -2130,13 +2142,13 @@ static void try_to_wake_up_local(struct task_struct *p, 
struct rq_flags *rf)
  */
 int wake_up_process(struct task_struct *p)
 {
-       return try_to_wake_up(p, TASK_NORMAL, 0);
+       return try_to_wake_up(p, TASK_NORMAL, 0, 1);
 }
 EXPORT_SYMBOL(wake_up_process);

 int wake_up_state(struct task_struct *p, unsigned int state)
 {
-       return try_to_wake_up(p, state, 0);
+       return try_to_wake_up(p, state, 0, 1);
 }

 /*
@@ -2442,7 +2454,7 @@ void wake_up_new_task(struct task_struct *p)
         * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
         * as we're not fully set-up yet.
         */
-       __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
+       __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0, 
1));
 #endif
        rq = __task_rq_lock(p, &rf);
        update_rq_clock(rq);
@@ -2893,7 +2905,7 @@ void sched_exec(void)
        int dest_cpu;

        raw_spin_lock_irqsave(&p->pi_lock, flags);
-       dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), 
SD_BALANCE_EXEC, 0);
+       dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), 
SD_BALANCE_EXEC, 0, 1);
        if (dest_cpu == smp_processor_id())
                goto unlock;

@@ -3582,7 +3594,7 @@ asmlinkage __visible void __sched 
preempt_schedule_irq(void)
 int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int 
wake_flags,
                          void *key)
 {
-       return try_to_wake_up(curr->private, mode, wake_flags);
+       return try_to_wake_up(curr->private, mode, wake_flags, 1);
 }
 EXPORT_SYMBOL(default_wake_function);

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 755bd3f1a1a9..69a9dd407267 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1516,7 +1516,8 @@ static void yield_task_dl(struct rq *rq)
 static int find_later_rq(struct task_struct *task);

 static int
-select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags,
+                 int sibling_count_hint)
 {
        struct task_struct *curr;
        struct rq *rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c95880e216f6..0a9d706b62bf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5332,15 +5332,18 @@ static void record_wakee(struct task_struct *p)
  * whatever is irrelevant, spread criteria is apparent partner count exceeds
  * socket size.
  */
-static int wake_wide(struct task_struct *p)
+static int wake_wide(struct task_struct *p, int sibling_count_hint)
 {
        unsigned int master = current->wakee_flips;
        unsigned int slave = p->wakee_flips;
-       int factor = this_cpu_read(sd_llc_size);
+       int llc_size = this_cpu_read(sd_llc_size);
+
+       if (sibling_count_hint >= llc_size)
+               return 1;

Since we are talking about 1 task per cpu, should it besibling_count_hint > llc_size ?

        if (master < slave)
                swap(master, slave);
-       if (slave < factor || master < slave * factor)
+       if (slave < llc_size || master < slave * llc_size)
                return 0;
        return 1;
 }
@@ -5869,7 +5872,8 @@ static int wake_cap(struct task_struct *p, int cpu, int 
prev_cpu)
  * preempt must be disabled.
  */
 static int
-select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int 
wake_flags)
+select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int 
wake_flags,
+                   int sibling_count_hint)
 {
        struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
        int cpu = smp_processor_id();
@@ -5879,8 +5883,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, 
int sd_flag, int wake_f

        if (sd_flag & SD_BALANCE_WAKE) {
                record_wakee(p);
-               want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
-                             && cpumask_test_cpu(cpu, &p->cpus_allowed);
+               want_affine = !wake_wide(p, sibling_count_hint) &&
+                             !wake_cap(p, cpu, prev_cpu) &&
+                             cpumask_test_cpu(cpu, &p->cpus_allowed);
        }

        rcu_read_lock();
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 0c00172db63e..3c343e055110 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -9,7 +9,8 @@

 #ifdef CONFIG_SMP
 static int
-select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags,
+                   int sibling_count_hint)
 {
        return task_cpu(p); /* IDLE tasks as never migrated */
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 45caf937ef90..b9937dccb8b3 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1387,7 +1387,8 @@ static void yield_task_rt(struct rq *rq)
 static int find_lowest_rq(struct task_struct *task);

 static int
-select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags,
+                 int sibling_count_hint)
 {
        struct task_struct *curr;
        struct rq *rq;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eeef1a3086d1..56ae525618e9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1419,7 +1419,8 @@ struct sched_class {
        void (*put_prev_task) (struct rq *rq, struct task_struct *p);

 #ifdef CONFIG_SMP
-       int  (*select_task_rq)(struct task_struct *p, int task_cpu, int 
sd_flag, int flags);
+       int  (*select_task_rq)(struct task_struct *p, int task_cpu, int 
sd_flag, int flags,
+                              int subling_count_hint);
        void (*migrate_task_rq)(struct task_struct *p);

        void (*task_woken) (struct rq *this_rq, struct task_struct *task);
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 9f69fb630853..d0ce4fbb18ef 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -11,7 +11,8 @@

 #ifdef CONFIG_SMP
 static int
-select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags,
+                   int sibling_count_hint)
 {
        return task_cpu(p); /* stop tasks as never migrate */
 }

Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide

Reply via email to