core: uclamp: add CPU's clamp groups accounting

Patrick Bellasi Mon, 29 Oct 2018 11:43:30 -0700

Slightly older version posted by error along with the correct one.
Please comment on:


   Message-ID: <[email protected]>

Sorry for the noise.

On 29-Oct 18:32, Patrick Bellasi wrote:
> Utilization clamping allows to clamp the utilization of a CPU within a
> [util_min, util_max] range which depends on the set of currently
> RUNNABLE tasks on that CPU.
> Each task references two "clamp groups" defining the minimum and maximum
> utilization clamp values to be considered for that task. These clamp
> value are mapped by a clamp group which is enforced on a CPU only when
> there is at least one RUNNABLE task referencing that clamp group.
> 
> When tasks are enqueued/dequeued on/from a CPU, the set of clamp groups
> active on that CPU can change. Since each clamp group enforces a
> different utilization clamp value, once the set of these groups changes
> it's required to re-compute what is the new "aggregated" clamp value to
> apply on that CPU.
> 
> Clamp values are always MAX aggregated for both util_min and util_max.
> This is to ensure that no tasks can affect the performance of other
> co-scheduled tasks which are either more boosted (i.e. with higher
> util_min clamp) or less capped (i.e. with higher util_max clamp).
> 
> Here we introduce the required support to properly reference count clamp
> groups at each task enqueue/dequeue time.
> 
> Tasks have a:
>    task_struct::uclamp::group_id[clamp_idx]
> indexing, for each clamp index (i.e. util_{min,max}), the clamp group
> they have to refcount at enqueue time.
> 
> CPUs rq have a:
>    rq::uclamp::group[clamp_idx][group_idx].tasks
> which is used to reference count how many tasks are currently RUNNABLE on
> that CPU for each clamp group of each clamp index.
> 
> The clamp value of each clamp group is tracked by
>    rq::uclamp::group[][].value
> thus making rq::uclamp::group[][] an unordered array of clamp values.
> However, the MAX aggregation of the currently active clamp groups is
> implemented to minimize the number of times we need to scan the complete
> (unordered) clamp group array to figure out the new max value. This
> operation indeed happens only when we dequeue the last task of the clamp
> group corresponding to the current max clamp, and thus the CPU is either
> entering IDLE or going to schedule a less boosted or more clamped task.
> Moreover, the expected number of different clamp values, which can be
> configured at build time, is usually so small that a more advanced
> ordering algorithm is not needed. In real use-cases we expect less then
> 10 different clamp values for each clamp index.
> 
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Paul Turner <[email protected]>
> Cc: Suren Baghdasaryan <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Quentin Perret <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> 
> ---
> Changes in v5:
>  Message-ID: <20180914134128.GP1413@e110439-lin>
>  - remove not required check for (group_id == UCLAMP_NOT_VALID)
>    in uclamp_cpu_put_id
>  Message-ID: <20180912174456.GJ1413@e110439-lin>
>  - use bitfields to compress uclamp_group
>  Others:
>  - consistently use "unsigned int" for both clamp_id and group_id
>  - fixup documentation
>  - reduced usage of inline comments
>  - rebased on v4.19.0
> 
> Changes in v4:
>  Message-ID: <20180816133249.GA2964@e110439-lin>
>  - keep the WARN in uclamp_cpu_put_id() but beautify a bit that code
>  - add another WARN on the unexpected condition of releasing a refcount
>    from a CPU which has a lower clamp value active
>  Other:
>  - ensure (and check) that all tasks have a valid group_id at
>    uclamp_cpu_get_id()
>  - rework uclamp_cpu layout to better fit into just 2x64B cache lines
>  - fix some s/SCHED_DEBUG/CONFIG_SCHED_DEBUG/
>  - rebased on v4.19-rc1
> Changes in v3:
>  Message-ID: 
> <CAJuCfpF6=L=0lrmnnjrtnpazt4dwkqnv+thhn0dwpkcguzs...@mail.gmail.com>
>  - add WARN on unlikely un-referenced decrement in uclamp_cpu_put_id()
>  - rename UCLAMP_NONE into UCLAMP_NOT_VALID
>  Message-ID: 
> <CAJuCfpGaKvxKcO=RLcmveHRB9qbMrvFs2yFVrk=k-v_m7jk...@mail.gmail.com>
>  - few typos fixed
>  Other:
>  - rebased on tip/sched/core
> Changes in v2:
>  Message-ID: <[email protected]>
>  - refactored struct rq::uclamp_cpu to be more cache efficient
>    no more holes, re-arranged vectors to match cache lines with expected
>    data locality
>  Message-ID: <[email protected]>
>  - use *rq as parameter whenever already available
>  - add scheduling class's uclamp_enabled marker
>  - get rid of the "confusing" single callback uclamp_task_update()
>    and use uclamp_cpu_{get,put}() directly from {en,de}queue_task()
>  - fix/remove "bad" comments
>  Message-ID: <20180413113337.GU14248@e110439-lin>
>  - remove inline from init_uclamp, flag it __init
>  Other:
>  - rabased on v4.18-rc4
>  - improved documentation to make more explicit some concepts.
> ---
>  include/linux/sched.h |   5 ++
>  kernel/sched/core.c   | 185 ++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h  |  49 +++++++++++
>  3 files changed, 239 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index facace271ea1..3ab1cbd4e3b1 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -604,11 +604,16 @@ struct sched_dl_entity {
>   * The mapped bit is set whenever a task has been mapped on a clamp group for
>   * the first time. When this bit is set, any clamp group get (for a new clamp
>   * value) will be matches by a clamp group put (for the old clamp value).
> + *
> + * The active bit is set whenever a task has got an effective clamp group
> + * and value assigned, which can be different from the user requested ones.
> + * This allows to know a task is actually refcounting a CPU's clamp group.
>   */
>  struct uclamp_se {
>       unsigned int value              : SCHED_CAPACITY_SHIFT + 1;
>       unsigned int group_id           : order_base_2(UCLAMP_GROUPS);
>       unsigned int mapped             : 1;
> +     unsigned int active             : 1;
>  };
>  #endif /* CONFIG_UCLAMP_TASK */
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 654327d7f212..a98a96a7d9f1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -783,6 +783,159 @@ union uclamp_map {
>   */
>  static union uclamp_map uclamp_maps[UCLAMP_CNT][UCLAMP_GROUPS];
>  
> +/**
> + * uclamp_cpu_update: updates the utilization clamp of a CPU
> + * @rq: the CPU's rq which utilization clamp has to be updated
> + * @clamp_id: the clamp index to update
> + *
> + * When tasks are enqueued/dequeued on/from a CPU, the set of currently 
> active
> + * clamp groups can change. Since each clamp group enforces a different
> + * utilization clamp value, once the set of active groups changes it can be
> + * required to re-compute what is the new clamp value to apply for that CPU.
> + *
> + * For the specified clamp index, this method computes the new CPU 
> utilization
> + * clamp to use until the next change on the set of active clamp groups.
> + */
> +static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id)
> +{
> +     unsigned int group_id;
> +     int max_value = 0;
> +
> +     for (group_id = 0; group_id < UCLAMP_GROUPS; ++group_id) {
> +             if (!rq->uclamp.group[clamp_id][group_id].tasks)
> +                     continue;
> +             /* Both min and max clamps are MAX aggregated */
> +             if (max_value < rq->uclamp.group[clamp_id][group_id].value)
> +                     max_value = rq->uclamp.group[clamp_id][group_id].value;
> +             if (max_value >= SCHED_CAPACITY_SCALE)
> +                     break;
> +     }
> +     rq->uclamp.value[clamp_id] = max_value;
> +}
> +
> +/**
> + * uclamp_cpu_get_id(): increase reference count for a clamp group on a CPU
> + * @p: the task being enqueued on a CPU
> + * @rq: the CPU's rq where the clamp group has to be reference counted
> + * @clamp_id: the clamp index to update
> + *
> + * Once a task is enqueued on a CPU's rq, the clamp group currently defined 
> by
> + * the task's uclamp::group_id is reference counted on that CPU.
> + */
> +static inline void uclamp_cpu_get_id(struct task_struct *p, struct rq *rq,
> +                                  unsigned int clamp_id)
> +{
> +     unsigned int group_id;
> +
> +     if (unlikely(!p->uclamp[clamp_id].mapped))
> +             return;
> +
> +     group_id = p->uclamp[clamp_id].group_id;
> +     p->uclamp[clamp_id].active = true;
> +
> +     rq->uclamp.group[clamp_id][group_id].tasks += 1;
> +
> +     if (rq->uclamp.value[clamp_id] < p->uclamp[clamp_id].value)
> +             rq->uclamp.value[clamp_id] = p->uclamp[clamp_id].value;
> +}
> +
> +/**
> + * uclamp_cpu_put_id(): decrease reference count for a clamp group on a CPU
> + * @p: the task being dequeued from a CPU
> + * @rq: the CPU's rq from where the clamp group has to be released
> + * @clamp_id: the clamp index to update
> + *
> + * When a task is dequeued from a CPU's rq, the CPU's clamp group reference
> + * counted by the task is released.
> + * If this was the last task reference coutning the current max clamp group,
> + * then the CPU clamping is updated to find the new max for the specified
> + * clamp index.
> + */
> +static inline void uclamp_cpu_put_id(struct task_struct *p, struct rq *rq,
> +                                  unsigned int clamp_id)
> +{
> +     unsigned int clamp_value;
> +     unsigned int group_id;
> +
> +     if (unlikely(!p->uclamp[clamp_id].mapped))
> +             return;
> +
> +     group_id = p->uclamp[clamp_id].group_id;
> +     p->uclamp[clamp_id].active = false;
> +
> +     if (likely(rq->uclamp.group[clamp_id][group_id].tasks))
> +             rq->uclamp.group[clamp_id][group_id].tasks -= 1;
> +#ifdef CONFIG_SCHED_DEBUG
> +     else {
> +             WARN(1, "invalid CPU[%d] clamp group [%u:%u] refcount\n",
> +                  cpu_of(rq), clamp_id, group_id);
> +     }
> +#endif
> +
> +     if (likely(rq->uclamp.group[clamp_id][group_id].tasks))
> +             return;
> +
> +     clamp_value = rq->uclamp.group[clamp_id][group_id].value;
> +#ifdef CONFIG_SCHED_DEBUG
> +     if (unlikely(clamp_value > rq->uclamp.value[clamp_id])) {
> +             WARN(1, "invalid CPU[%d] clamp group [%u:%u] value\n",
> +                  cpu_of(rq), clamp_id, group_id);
> +     }
> +#endif
> +     if (clamp_value >= rq->uclamp.value[clamp_id])
> +             uclamp_cpu_update(rq, clamp_id);
> +}
> +
> +/**
> + * uclamp_cpu_get(): increase CPU's clamp group refcount
> + * @rq: the CPU's rq where the task is enqueued
> + * @p: the task being enqueued
> + *
> + * When a task is enqueued on a CPU's rq, all the clamp groups currently
> + * enforced on a task are reference counted on that rq. Since not all
> + * scheduling classes have utilization clamping support, their tasks will
> + * be silently ignored.
> + *
> + * This method updates the utilization clamp constraints considering the
> + * requirements for the specified task. Thus, this update must be done before
> + * calling into the scheduling classes, which will eventually update 
> schedutil
> + * considering the new task requirements.
> + */
> +static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p)
> +{
> +     unsigned int clamp_id;
> +
> +     if (unlikely(!p->sched_class->uclamp_enabled))
> +             return;
> +
> +     for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> +             uclamp_cpu_get_id(p, rq, clamp_id);
> +}
> +
> +/**
> + * uclamp_cpu_put(): decrease CPU's clamp group refcount
> + * @rq: the CPU's rq from where the task is dequeued
> + * @p: the task being dequeued
> + *
> + * When a task is dequeued from a CPU's rq, all the clamp groups the task has
> + * reference counted at enqueue time are now released.
> + *
> + * This method updates the utilization clamp constraints considering the
> + * requirements for the specified task. Thus, this update must be done before
> + * calling into the scheduling classes, which will eventually update 
> schedutil
> + * considering the new task requirements.
> + */
> +static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p)
> +{
> +     unsigned int clamp_id;
> +
> +     if (unlikely(!p->sched_class->uclamp_enabled))
> +             return;
> +
> +     for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> +             uclamp_cpu_put_id(p, rq, clamp_id);
> +}
> +
>  /**
>   * uclamp_group_put: decrease the reference count for a clamp group
>   * @clamp_id: the clamp index which was affected by a task group
> @@ -836,6 +989,7 @@ static void uclamp_group_get(struct uclamp_se *uc_se, 
> unsigned int clamp_id,
>       unsigned int free_group_id;
>       unsigned int group_id;
>       unsigned long res;
> +     int cpu;
>  
>  retry:
>  
> @@ -866,6 +1020,28 @@ static void uclamp_group_get(struct uclamp_se *uc_se, 
> unsigned int clamp_id,
>       if (res != uc_map_old.data)
>               goto retry;
>  
> +     /* Ensure each CPU tracks the correct value for this clamp group */
> +     if (likely(uc_map_new.se_count > 1))
> +             goto done;
> +     for_each_possible_cpu(cpu) {
> +             struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp;
> +
> +             /* Refcounting is expected to be always 0 for free groups */
> +             if (unlikely(uc_cpu->group[clamp_id][group_id].tasks)) {
> +                     uc_cpu->group[clamp_id][group_id].tasks = 0;
> +#ifdef CONFIG_SCHED_DEBUG
> +                     WARN(1, "invalid CPU[%d] clamp group [%u:%u] 
> refcount\n",
> +                          cpu, clamp_id, group_id);
> +#endif
> +             }
> +
> +             if (uc_cpu->group[clamp_id][group_id].value == clamp_value)
> +                     continue;
> +             uc_cpu->group[clamp_id][group_id].value = clamp_value;
> +     }
> +
> +done:
> +
>       /* Update SE's clamp values and attach it to new clamp group */
>       uc_se->value = clamp_value;
>       uc_se->group_id = group_id;
> @@ -948,6 +1124,7 @@ static void uclamp_fork(struct task_struct *p, bool 
> reset)
>                       clamp_value = uclamp_none(clamp_id);
>  
>               p->uclamp[clamp_id].mapped = false;
> +             p->uclamp[clamp_id].active = false;
>               uclamp_group_get(&p->uclamp[clamp_id], clamp_id, clamp_value);
>       }
>  }
> @@ -959,9 +1136,13 @@ static void __init init_uclamp(void)
>  {
>       struct uclamp_se *uc_se;
>       unsigned int clamp_id;
> +     int cpu;
>  
>       mutex_init(&uclamp_mutex);
>  
> +     for_each_possible_cpu(cpu)
> +             memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_cpu));
> +
>       memset(uclamp_maps, 0, sizeof(uclamp_maps));
>       for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
>               uc_se = &init_task.uclamp[clamp_id];
> @@ -970,6 +1151,8 @@ static void __init init_uclamp(void)
>  }
>  
>  #else /* CONFIG_UCLAMP_TASK */
> +static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { }
> +static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { }
>  static inline int __setscheduler_uclamp(struct task_struct *p,
>                                       const struct sched_attr *attr)
>  {
> @@ -987,6 +1170,7 @@ static inline void enqueue_task(struct rq *rq, struct 
> task_struct *p, int flags)
>       if (!(flags & ENQUEUE_RESTORE))
>               sched_info_queued(rq, p);
>  
> +     uclamp_cpu_get(rq, p);
>       p->sched_class->enqueue_task(rq, p, flags);
>  }
>  
> @@ -998,6 +1182,7 @@ static inline void dequeue_task(struct rq *rq, struct 
> task_struct *p, int flags)
>       if (!(flags & DEQUEUE_SAVE))
>               sched_info_dequeued(rq, p);
>  
> +     uclamp_cpu_put(rq, p);
>       p->sched_class->dequeue_task(rq, p, flags);
>  }
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 947ab14d3d5b..1755c9c9f4f0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -766,6 +766,50 @@ extern void rto_push_irq_work_func(struct irq_work 
> *work);
>  #endif
>  #endif /* CONFIG_SMP */
>  
> +#ifdef CONFIG_UCLAMP_TASK
> +/**
> + * struct uclamp_group - Utilization clamp Group
> + * @value: utilization clamp value for tasks on this clamp group
> + * @tasks: number of RUNNABLE tasks on this clamp group
> + *
> + * Keep track of how many tasks are RUNNABLE for a given utilization
> + * clamp value.
> + */
> +struct uclamp_group {
> +     unsigned long value : SCHED_CAPACITY_SHIFT + 1;
> +     unsigned long tasks : BITS_PER_LONG - SCHED_CAPACITY_SHIFT - 1;
> +};
> +
> +/**
> + * struct uclamp_cpu - CPU's utilization clamp
> + * @value: currently active clamp values for a CPU
> + * @group: utilization clamp groups affecting a CPU
> + *
> + * Keep track of RUNNABLE tasks on a CPUs to aggregate their clamp values.
> + * A clamp value is affecting a CPU where there is at least one task RUNNABLE
> + * (or actually running) with that value.
> + *
> + * We have up to UCLAMP_CNT possible different clamp value, which are
> + * currently only two: minmum utilization and maximum utilization.
> + *
> + * All utilization clamping values are MAX aggregated, since:
> + * - for util_min: we want to run the CPU at least at the max of the minimum
> + *   utilization required by its currently RUNNABLE tasks.
> + * - for util_max: we want to allow the CPU to run up to the max of the
> + *   maximum utilization allowed by its currently RUNNABLE tasks.
> + *
> + * Since on each system we expect only a limited number of different
> + * utilization clamp values (CONFIG_UCLAMP_GROUPS_COUNT), we use a simple
> + * array to track the metrics required to compute all the per-CPU utilization
> + * clamp values. The additional slot is used to track the default clamp
> + * values, i.e. no min/max clamping at all.
> + */
> +struct uclamp_cpu {
> +     struct uclamp_group group[UCLAMP_CNT][CONFIG_UCLAMP_GROUPS_COUNT + 1];
> +     int value[UCLAMP_CNT];
> +};
> +#endif /* CONFIG_UCLAMP_TASK */
> +
>  /*
>   * This is the main, per-CPU runqueue data structure.
>   *
> @@ -804,6 +848,11 @@ struct rq {
>       unsigned long           nr_load_updates;
>       u64                     nr_switches;
>  
> +#ifdef CONFIG_UCLAMP_TASK
> +     /* Utilization clamp values based on CPU's RUNNABLE tasks */
> +     struct uclamp_cpu       uclamp ____cacheline_aligned;
> +#endif
> +
>       struct cfs_rq           cfs;
>       struct rt_rq            rt;
>       struct dl_rq            dl;
> -- 
> 2.18.0
> 

-- 
#include <best/regards.h>

Patrick Bellasi

Re: [PATCH v5 04/15] sched/core: uclamp: add CPU's clamp groups accounting

Reply via email to