Slightly older version posted by error along with the correct one. Please comment on:
Message-ID: <20181029183311.29175-6-patrick.bell...@arm.com> Sorry for the noise. On 29-Oct 18:32, Patrick Bellasi wrote: > Utilization clamping allows to clamp the utilization of a CPU within a > [util_min, util_max] range which depends on the set of currently > RUNNABLE tasks on that CPU. > Each task references two "clamp groups" defining the minimum and maximum > utilization clamp values to be considered for that task. These clamp > value are mapped by a clamp group which is enforced on a CPU only when > there is at least one RUNNABLE task referencing that clamp group. > > When tasks are enqueued/dequeued on/from a CPU, the set of clamp groups > active on that CPU can change. Since each clamp group enforces a > different utilization clamp value, once the set of these groups changes > it's required to re-compute what is the new "aggregated" clamp value to > apply on that CPU. > > Clamp values are always MAX aggregated for both util_min and util_max. > This is to ensure that no tasks can affect the performance of other > co-scheduled tasks which are either more boosted (i.e. with higher > util_min clamp) or less capped (i.e. with higher util_max clamp). > > Here we introduce the required support to properly reference count clamp > groups at each task enqueue/dequeue time. > > Tasks have a: > task_struct::uclamp::group_id[clamp_idx] > indexing, for each clamp index (i.e. util_{min,max}), the clamp group > they have to refcount at enqueue time. > > CPUs rq have a: > rq::uclamp::group[clamp_idx][group_idx].tasks > which is used to reference count how many tasks are currently RUNNABLE on > that CPU for each clamp group of each clamp index. > > The clamp value of each clamp group is tracked by > rq::uclamp::group[][].value > thus making rq::uclamp::group[][] an unordered array of clamp values. > However, the MAX aggregation of the currently active clamp groups is > implemented to minimize the number of times we need to scan the complete > (unordered) clamp group array to figure out the new max value. This > operation indeed happens only when we dequeue the last task of the clamp > group corresponding to the current max clamp, and thus the CPU is either > entering IDLE or going to schedule a less boosted or more clamped task. > Moreover, the expected number of different clamp values, which can be > configured at build time, is usually so small that a more advanced > ordering algorithm is not needed. In real use-cases we expect less then > 10 different clamp values for each clamp index. > > Signed-off-by: Patrick Bellasi <patrick.bell...@arm.com> > Cc: Ingo Molnar <mi...@redhat.com> > Cc: Peter Zijlstra <pet...@infradead.org> > Cc: Paul Turner <p...@google.com> > Cc: Suren Baghdasaryan <sur...@google.com> > Cc: Todd Kjos <tk...@google.com> > Cc: Joel Fernandes <joe...@google.com> > Cc: Juri Lelli <juri.le...@redhat.com> > Cc: Quentin Perret <quentin.per...@arm.com> > Cc: Dietmar Eggemann <dietmar.eggem...@arm.com> > Cc: Morten Rasmussen <morten.rasmus...@arm.com> > Cc: linux-kernel@vger.kernel.org > Cc: linux...@vger.kernel.org > > --- > Changes in v5: > Message-ID: <20180914134128.GP1413@e110439-lin> > - remove not required check for (group_id == UCLAMP_NOT_VALID) > in uclamp_cpu_put_id > Message-ID: <20180912174456.GJ1413@e110439-lin> > - use bitfields to compress uclamp_group > Others: > - consistently use "unsigned int" for both clamp_id and group_id > - fixup documentation > - reduced usage of inline comments > - rebased on v4.19.0 > > Changes in v4: > Message-ID: <20180816133249.GA2964@e110439-lin> > - keep the WARN in uclamp_cpu_put_id() but beautify a bit that code > - add another WARN on the unexpected condition of releasing a refcount > from a CPU which has a lower clamp value active > Other: > - ensure (and check) that all tasks have a valid group_id at > uclamp_cpu_get_id() > - rework uclamp_cpu layout to better fit into just 2x64B cache lines > - fix some s/SCHED_DEBUG/CONFIG_SCHED_DEBUG/ > - rebased on v4.19-rc1 > Changes in v3: > Message-ID: > <CAJuCfpF6=L=0lrmnnjrtnpazt4dwkqnv+thhn0dwpkcguzs...@mail.gmail.com> > - add WARN on unlikely un-referenced decrement in uclamp_cpu_put_id() > - rename UCLAMP_NONE into UCLAMP_NOT_VALID > Message-ID: > <CAJuCfpGaKvxKcO=RLcmveHRB9qbMrvFs2yFVrk=k-v_m7jk...@mail.gmail.com> > - few typos fixed > Other: > - rebased on tip/sched/core > Changes in v2: > Message-ID: <20180413093822.gm4...@hirez.programming.kicks-ass.net> > - refactored struct rq::uclamp_cpu to be more cache efficient > no more holes, re-arranged vectors to match cache lines with expected > data locality > Message-ID: <20180413094615.gt4...@hirez.programming.kicks-ass.net> > - use *rq as parameter whenever already available > - add scheduling class's uclamp_enabled marker > - get rid of the "confusing" single callback uclamp_task_update() > and use uclamp_cpu_{get,put}() directly from {en,de}queue_task() > - fix/remove "bad" comments > Message-ID: <20180413113337.GU14248@e110439-lin> > - remove inline from init_uclamp, flag it __init > Other: > - rabased on v4.18-rc4 > - improved documentation to make more explicit some concepts. > --- > include/linux/sched.h | 5 ++ > kernel/sched/core.c | 185 ++++++++++++++++++++++++++++++++++++++++++ > kernel/sched/sched.h | 49 +++++++++++ > 3 files changed, 239 insertions(+) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index facace271ea1..3ab1cbd4e3b1 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -604,11 +604,16 @@ struct sched_dl_entity { > * The mapped bit is set whenever a task has been mapped on a clamp group for > * the first time. When this bit is set, any clamp group get (for a new clamp > * value) will be matches by a clamp group put (for the old clamp value). > + * > + * The active bit is set whenever a task has got an effective clamp group > + * and value assigned, which can be different from the user requested ones. > + * This allows to know a task is actually refcounting a CPU's clamp group. > */ > struct uclamp_se { > unsigned int value : SCHED_CAPACITY_SHIFT + 1; > unsigned int group_id : order_base_2(UCLAMP_GROUPS); > unsigned int mapped : 1; > + unsigned int active : 1; > }; > #endif /* CONFIG_UCLAMP_TASK */ > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 654327d7f212..a98a96a7d9f1 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -783,6 +783,159 @@ union uclamp_map { > */ > static union uclamp_map uclamp_maps[UCLAMP_CNT][UCLAMP_GROUPS]; > > +/** > + * uclamp_cpu_update: updates the utilization clamp of a CPU > + * @rq: the CPU's rq which utilization clamp has to be updated > + * @clamp_id: the clamp index to update > + * > + * When tasks are enqueued/dequeued on/from a CPU, the set of currently > active > + * clamp groups can change. Since each clamp group enforces a different > + * utilization clamp value, once the set of active groups changes it can be > + * required to re-compute what is the new clamp value to apply for that CPU. > + * > + * For the specified clamp index, this method computes the new CPU > utilization > + * clamp to use until the next change on the set of active clamp groups. > + */ > +static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id) > +{ > + unsigned int group_id; > + int max_value = 0; > + > + for (group_id = 0; group_id < UCLAMP_GROUPS; ++group_id) { > + if (!rq->uclamp.group[clamp_id][group_id].tasks) > + continue; > + /* Both min and max clamps are MAX aggregated */ > + if (max_value < rq->uclamp.group[clamp_id][group_id].value) > + max_value = rq->uclamp.group[clamp_id][group_id].value; > + if (max_value >= SCHED_CAPACITY_SCALE) > + break; > + } > + rq->uclamp.value[clamp_id] = max_value; > +} > + > +/** > + * uclamp_cpu_get_id(): increase reference count for a clamp group on a CPU > + * @p: the task being enqueued on a CPU > + * @rq: the CPU's rq where the clamp group has to be reference counted > + * @clamp_id: the clamp index to update > + * > + * Once a task is enqueued on a CPU's rq, the clamp group currently defined > by > + * the task's uclamp::group_id is reference counted on that CPU. > + */ > +static inline void uclamp_cpu_get_id(struct task_struct *p, struct rq *rq, > + unsigned int clamp_id) > +{ > + unsigned int group_id; > + > + if (unlikely(!p->uclamp[clamp_id].mapped)) > + return; > + > + group_id = p->uclamp[clamp_id].group_id; > + p->uclamp[clamp_id].active = true; > + > + rq->uclamp.group[clamp_id][group_id].tasks += 1; > + > + if (rq->uclamp.value[clamp_id] < p->uclamp[clamp_id].value) > + rq->uclamp.value[clamp_id] = p->uclamp[clamp_id].value; > +} > + > +/** > + * uclamp_cpu_put_id(): decrease reference count for a clamp group on a CPU > + * @p: the task being dequeued from a CPU > + * @rq: the CPU's rq from where the clamp group has to be released > + * @clamp_id: the clamp index to update > + * > + * When a task is dequeued from a CPU's rq, the CPU's clamp group reference > + * counted by the task is released. > + * If this was the last task reference coutning the current max clamp group, > + * then the CPU clamping is updated to find the new max for the specified > + * clamp index. > + */ > +static inline void uclamp_cpu_put_id(struct task_struct *p, struct rq *rq, > + unsigned int clamp_id) > +{ > + unsigned int clamp_value; > + unsigned int group_id; > + > + if (unlikely(!p->uclamp[clamp_id].mapped)) > + return; > + > + group_id = p->uclamp[clamp_id].group_id; > + p->uclamp[clamp_id].active = false; > + > + if (likely(rq->uclamp.group[clamp_id][group_id].tasks)) > + rq->uclamp.group[clamp_id][group_id].tasks -= 1; > +#ifdef CONFIG_SCHED_DEBUG > + else { > + WARN(1, "invalid CPU[%d] clamp group [%u:%u] refcount\n", > + cpu_of(rq), clamp_id, group_id); > + } > +#endif > + > + if (likely(rq->uclamp.group[clamp_id][group_id].tasks)) > + return; > + > + clamp_value = rq->uclamp.group[clamp_id][group_id].value; > +#ifdef CONFIG_SCHED_DEBUG > + if (unlikely(clamp_value > rq->uclamp.value[clamp_id])) { > + WARN(1, "invalid CPU[%d] clamp group [%u:%u] value\n", > + cpu_of(rq), clamp_id, group_id); > + } > +#endif > + if (clamp_value >= rq->uclamp.value[clamp_id]) > + uclamp_cpu_update(rq, clamp_id); > +} > + > +/** > + * uclamp_cpu_get(): increase CPU's clamp group refcount > + * @rq: the CPU's rq where the task is enqueued > + * @p: the task being enqueued > + * > + * When a task is enqueued on a CPU's rq, all the clamp groups currently > + * enforced on a task are reference counted on that rq. Since not all > + * scheduling classes have utilization clamping support, their tasks will > + * be silently ignored. > + * > + * This method updates the utilization clamp constraints considering the > + * requirements for the specified task. Thus, this update must be done before > + * calling into the scheduling classes, which will eventually update > schedutil > + * considering the new task requirements. > + */ > +static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) > +{ > + unsigned int clamp_id; > + > + if (unlikely(!p->sched_class->uclamp_enabled)) > + return; > + > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) > + uclamp_cpu_get_id(p, rq, clamp_id); > +} > + > +/** > + * uclamp_cpu_put(): decrease CPU's clamp group refcount > + * @rq: the CPU's rq from where the task is dequeued > + * @p: the task being dequeued > + * > + * When a task is dequeued from a CPU's rq, all the clamp groups the task has > + * reference counted at enqueue time are now released. > + * > + * This method updates the utilization clamp constraints considering the > + * requirements for the specified task. Thus, this update must be done before > + * calling into the scheduling classes, which will eventually update > schedutil > + * considering the new task requirements. > + */ > +static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) > +{ > + unsigned int clamp_id; > + > + if (unlikely(!p->sched_class->uclamp_enabled)) > + return; > + > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) > + uclamp_cpu_put_id(p, rq, clamp_id); > +} > + > /** > * uclamp_group_put: decrease the reference count for a clamp group > * @clamp_id: the clamp index which was affected by a task group > @@ -836,6 +989,7 @@ static void uclamp_group_get(struct uclamp_se *uc_se, > unsigned int clamp_id, > unsigned int free_group_id; > unsigned int group_id; > unsigned long res; > + int cpu; > > retry: > > @@ -866,6 +1020,28 @@ static void uclamp_group_get(struct uclamp_se *uc_se, > unsigned int clamp_id, > if (res != uc_map_old.data) > goto retry; > > + /* Ensure each CPU tracks the correct value for this clamp group */ > + if (likely(uc_map_new.se_count > 1)) > + goto done; > + for_each_possible_cpu(cpu) { > + struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp; > + > + /* Refcounting is expected to be always 0 for free groups */ > + if (unlikely(uc_cpu->group[clamp_id][group_id].tasks)) { > + uc_cpu->group[clamp_id][group_id].tasks = 0; > +#ifdef CONFIG_SCHED_DEBUG > + WARN(1, "invalid CPU[%d] clamp group [%u:%u] > refcount\n", > + cpu, clamp_id, group_id); > +#endif > + } > + > + if (uc_cpu->group[clamp_id][group_id].value == clamp_value) > + continue; > + uc_cpu->group[clamp_id][group_id].value = clamp_value; > + } > + > +done: > + > /* Update SE's clamp values and attach it to new clamp group */ > uc_se->value = clamp_value; > uc_se->group_id = group_id; > @@ -948,6 +1124,7 @@ static void uclamp_fork(struct task_struct *p, bool > reset) > clamp_value = uclamp_none(clamp_id); > > p->uclamp[clamp_id].mapped = false; > + p->uclamp[clamp_id].active = false; > uclamp_group_get(&p->uclamp[clamp_id], clamp_id, clamp_value); > } > } > @@ -959,9 +1136,13 @@ static void __init init_uclamp(void) > { > struct uclamp_se *uc_se; > unsigned int clamp_id; > + int cpu; > > mutex_init(&uclamp_mutex); > > + for_each_possible_cpu(cpu) > + memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_cpu)); > + > memset(uclamp_maps, 0, sizeof(uclamp_maps)); > for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) { > uc_se = &init_task.uclamp[clamp_id]; > @@ -970,6 +1151,8 @@ static void __init init_uclamp(void) > } > > #else /* CONFIG_UCLAMP_TASK */ > +static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { } > +static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { } > static inline int __setscheduler_uclamp(struct task_struct *p, > const struct sched_attr *attr) > { > @@ -987,6 +1170,7 @@ static inline void enqueue_task(struct rq *rq, struct > task_struct *p, int flags) > if (!(flags & ENQUEUE_RESTORE)) > sched_info_queued(rq, p); > > + uclamp_cpu_get(rq, p); > p->sched_class->enqueue_task(rq, p, flags); > } > > @@ -998,6 +1182,7 @@ static inline void dequeue_task(struct rq *rq, struct > task_struct *p, int flags) > if (!(flags & DEQUEUE_SAVE)) > sched_info_dequeued(rq, p); > > + uclamp_cpu_put(rq, p); > p->sched_class->dequeue_task(rq, p, flags); > } > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 947ab14d3d5b..1755c9c9f4f0 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -766,6 +766,50 @@ extern void rto_push_irq_work_func(struct irq_work > *work); > #endif > #endif /* CONFIG_SMP */ > > +#ifdef CONFIG_UCLAMP_TASK > +/** > + * struct uclamp_group - Utilization clamp Group > + * @value: utilization clamp value for tasks on this clamp group > + * @tasks: number of RUNNABLE tasks on this clamp group > + * > + * Keep track of how many tasks are RUNNABLE for a given utilization > + * clamp value. > + */ > +struct uclamp_group { > + unsigned long value : SCHED_CAPACITY_SHIFT + 1; > + unsigned long tasks : BITS_PER_LONG - SCHED_CAPACITY_SHIFT - 1; > +}; > + > +/** > + * struct uclamp_cpu - CPU's utilization clamp > + * @value: currently active clamp values for a CPU > + * @group: utilization clamp groups affecting a CPU > + * > + * Keep track of RUNNABLE tasks on a CPUs to aggregate their clamp values. > + * A clamp value is affecting a CPU where there is at least one task RUNNABLE > + * (or actually running) with that value. > + * > + * We have up to UCLAMP_CNT possible different clamp value, which are > + * currently only two: minmum utilization and maximum utilization. > + * > + * All utilization clamping values are MAX aggregated, since: > + * - for util_min: we want to run the CPU at least at the max of the minimum > + * utilization required by its currently RUNNABLE tasks. > + * - for util_max: we want to allow the CPU to run up to the max of the > + * maximum utilization allowed by its currently RUNNABLE tasks. > + * > + * Since on each system we expect only a limited number of different > + * utilization clamp values (CONFIG_UCLAMP_GROUPS_COUNT), we use a simple > + * array to track the metrics required to compute all the per-CPU utilization > + * clamp values. The additional slot is used to track the default clamp > + * values, i.e. no min/max clamping at all. > + */ > +struct uclamp_cpu { > + struct uclamp_group group[UCLAMP_CNT][CONFIG_UCLAMP_GROUPS_COUNT + 1]; > + int value[UCLAMP_CNT]; > +}; > +#endif /* CONFIG_UCLAMP_TASK */ > + > /* > * This is the main, per-CPU runqueue data structure. > * > @@ -804,6 +848,11 @@ struct rq { > unsigned long nr_load_updates; > u64 nr_switches; > > +#ifdef CONFIG_UCLAMP_TASK > + /* Utilization clamp values based on CPU's RUNNABLE tasks */ > + struct uclamp_cpu uclamp ____cacheline_aligned; > +#endif > + > struct cfs_rq cfs; > struct rt_rq rt; > struct dl_rq dl; > -- > 2.18.0 > -- #include <best/regards.h> Patrick Bellasi