Hi Vincent, Thank you for your review. On 11/15/2012 11:43 PM, Vincent Guittot wrote: > Hi Preeti, > > On 15 November 2012 17:54, Preeti U Murthy <pre...@linux.vnet.ibm.com> wrote: >> Currently the load balancer weighs a task based upon its priority,and this >> weight consequently gets added up to the weight of the run queue that it is >> on.It is this weight of the runqueue that sums up to a sched group's load >> which is used to decide the busiest or the idlest group and the runqueue >> thereof. >> >> The Per-Entity-Load-Tracking metric however measures how long a task has >> been runnable over the duration of its lifetime.This gives us a hint of >> the amount of CPU time that the task can demand.This metric takes care of the >> task priority as well.Therefore apart from the priority of a task we also >> have an idea of the live behavior of the task.This seems to be a more >> realistic metric to use to compute task weight which adds upto the run queue >> weight and the weight of the sched group.Consequently they can be used for >> load balancing. >> >> The semantics of load balancing is left untouched.The two functions >> load_balance() and select_task_rq_fair() perform the task of load >> balancing.These two paths have been browsed through in this patch to make >> necessary changes. >> >> weighted_cpuload() and task_h_load() provide the run queue weight and the >> weight of the task respectively.They have been modified to provide the >> Per-Entity-Load-Tracking metric as relevant for each. >> The rest of the modifications had to be made to suit these two changes. >> >> Completely Fair Scheduler class is the only sched_class which contributes to >> the run queue load.Therefore the rq->load.weight==cfs_rq->load.weight when >> the cfs_rq is the root cfs_rq (rq->cfs) of the hierarchy.When replacing this >> with Per-Entity-Load-Tracking metric,cfs_rq->runnable_load_avg needs to be >> used as this is the right reflection of the run queue load when >> the cfs_rq is the root cfs_rq (rq->cfs) of the hierarchy.This metric reflects >> the percentage uptime of the tasks that are queued on it and hence that >> contribute >> to the load.Thus cfs_rq->runnable_load_avg replaces the metric earlier used >> in >> weighted_cpuload(). >> >> The task load is aptly captured by se.avg.load_avg_contrib which captures the >> runnable time vs the alive time of the task against its priority.This metric >> replaces the earlier metric used in task_h_load(). >> >> The consequent changes appear as data type changes for the helper variables; >> they abound in number.Because cfs_rq->runnable_load_avg needs to be big >> enough >> to capture the tasks' load often and accurately. > > You are now using cfs_rq->runnable_load_avg instead of > cfs_rq->load.weight for calculation of cpu_load but > cfs_rq->runnable_load_avg is smaller or equal to cfs_rq->load.weight > value. This implies that the new value is smaller or equal to the old > statistic so you should be able to keep the same variable width for > the computation of cpu_load
Right.But cfs_rq->runnable_load_avg is a 64 bit unsigned integer as per the Per-entity-load-tracking patchset.I could not figure out why this is the case although as you mention, its value will not exceed cfs_rq->load.weight.In order to retain the data type of cfs_rq->runnable_load_avg as it is,these changes had to be made to suit it.It would be good if someone would clarify why it is a 64 bit integer,will save a lot of trouble if we could consider this the same length as cfs_rq->load.weight.Ben,Paul? can you clarify this point? > >> >> The following patch does not consider CONFIG_FAIR_GROUP_SCHED AND >> CONFIG_SCHED_NUMA.This is done so as to evaluate this approach starting from >> the >> simplest scenario.Earlier discussions can be found in the link below. >> >> Link: https://lkml.org/lkml/2012/10/25/162 >> Signed-off-by: Preeti U Murthy<pre...@linux.vnet.ibm.com> >> --- >> include/linux/sched.h | 2 +- >> kernel/sched/core.c | 12 +++++---- >> kernel/sched/fair.c | 64 >> +++++++++++++++++++++++++------------------------ >> kernel/sched/sched.h | 2 +- >> 4 files changed, 40 insertions(+), 40 deletions(-) >> >> diff --git a/include/linux/sched.h b/include/linux/sched.h >> index 087dd20..302756e 100644 >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -924,7 +924,7 @@ struct sched_domain { >> unsigned int lb_count[CPU_MAX_IDLE_TYPES]; >> unsigned int lb_failed[CPU_MAX_IDLE_TYPES]; >> unsigned int lb_balanced[CPU_MAX_IDLE_TYPES]; >> - unsigned int lb_imbalance[CPU_MAX_IDLE_TYPES]; >> + u64 lb_imbalance[CPU_MAX_IDLE_TYPES]; >> unsigned int lb_gained[CPU_MAX_IDLE_TYPES]; >> unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES]; >> unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES]; >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index 24d8b9b..4dea057 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -2415,8 +2415,8 @@ static const unsigned char >> * would be when CPU is idle and so we just decay the old load without >> * adding any new load. >> */ >> -static unsigned long >> -decay_load_missed(unsigned long load, unsigned long missed_updates, int idx) >> +static u64 >> +decay_load_missed(u64 load, unsigned long missed_updates, int idx) >> { >> int j = 0; >> >> @@ -2444,7 +2444,7 @@ decay_load_missed(unsigned long load, unsigned long >> missed_updates, int idx) >> * scheduler tick (TICK_NSEC). With tickless idle this will not be called >> * every tick. We fix it up based on jiffies. >> */ >> -static void __update_cpu_load(struct rq *this_rq, unsigned long this_load, >> +static void __update_cpu_load(struct rq *this_rq, u64 this_load, >> unsigned long pending_updates) >> { >> int i, scale; >> @@ -2454,7 +2454,7 @@ static void __update_cpu_load(struct rq *this_rq, >> unsigned long this_load, >> /* Update our load: */ >> this_rq->cpu_load[0] = this_load; /* Fasttrack for idx 0 */ >> for (i = 1, scale = 2; i < CPU_LOAD_IDX_MAX; i++, scale += scale) { >> - unsigned long old_load, new_load; >> + u64 old_load, new_load; >> >> /* scale is effectively 1 << i now, and >> i divides by >> scale */ >> >> @@ -2496,7 +2496,7 @@ static void __update_cpu_load(struct rq *this_rq, >> unsigned long this_load, >> void update_idle_cpu_load(struct rq *this_rq) >> { >> unsigned long curr_jiffies = ACCESS_ONCE(jiffies); >> - unsigned long load = this_rq->load.weight; >> + u64 load = this_rq->cfs.runnable_load_avg; >> unsigned long pending_updates; >> >> /* >> @@ -2546,7 +2546,7 @@ static void update_cpu_load_active(struct rq *this_rq) >> * See the mess around update_idle_cpu_load() / >> update_cpu_load_nohz(). >> */ >> this_rq->last_load_update_tick = jiffies; >> - __update_cpu_load(this_rq, this_rq->load.weight, 1); >> + __update_cpu_load(this_rq, this_rq->cfs.runnable_load_avg, 1); >> >> calc_load_account_active(this_rq); >> } >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index a9cdc8f..f8f3a29 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -2935,9 +2935,9 @@ static void dequeue_task_fair(struct rq *rq, struct >> task_struct *p, int flags) >> >> #ifdef CONFIG_SMP >> /* Used instead of source_load when we know the type == 0 */ >> -static unsigned long weighted_cpuload(const int cpu) >> +static u64 weighted_cpuload(const int cpu) >> { >> - return cpu_rq(cpu)->load.weight; >> + return cpu_rq(cpu)->cfs.runnable_load_avg; >> } >> >> /* >> @@ -2947,10 +2947,10 @@ static unsigned long weighted_cpuload(const int cpu) >> * We want to under-estimate the load of migration sources, to >> * balance conservatively. >> */ >> -static unsigned long source_load(int cpu, int type) >> +static u64 source_load(int cpu, int type) >> { >> struct rq *rq = cpu_rq(cpu); >> - unsigned long total = weighted_cpuload(cpu); >> + u64 total = weighted_cpuload(cpu); >> >> if (type == 0 || !sched_feat(LB_BIAS)) >> return total; >> @@ -2962,10 +2962,10 @@ static unsigned long source_load(int cpu, int type) >> * Return a high guess at the load of a migration-target cpu weighted >> * according to the scheduling class and "nice" value. >> */ >> -static unsigned long target_load(int cpu, int type) >> +static u64 target_load(int cpu, int type) >> { >> struct rq *rq = cpu_rq(cpu); >> - unsigned long total = weighted_cpuload(cpu); >> + u64 total = weighted_cpuload(cpu); >> >> if (type == 0 || !sched_feat(LB_BIAS)) >> return total; >> @@ -2978,13 +2978,13 @@ static unsigned long power_of(int cpu) >> return cpu_rq(cpu)->cpu_power; >> } >> >> -static unsigned long cpu_avg_load_per_task(int cpu) >> +static u64 cpu_avg_load_per_task(int cpu) >> { >> struct rq *rq = cpu_rq(cpu); >> unsigned long nr_running = ACCESS_ONCE(rq->nr_running); >> >> if (nr_running) >> - return rq->load.weight / nr_running; >> + return rq->cfs.runnable_load_avg / nr_running; > > You now need to use div_u64 for all division of a 64bits variable like > runnable_load_avg otherwise it can't compile on 32bits platform like > ARM. This one is obvious because it appears in your patch but other > division could be now 64bits division Ah yes,there will be some trouble here,Explicit do_div() calls need to be inserted,and there will be plenty such cases.But as I mentioned above,once we are clear about why the width of cfs_rq->runnable_load_avg is 64 bit, we can sort this out.We will need someone to clarify this. I am at a loss to see the solution around making the above changes if for some reason the width of cfs_rq->runnable_load_avg has to be maintained as is.any thoughts on this? > > Regards, > Vincent Regards Preeti U Murthy _______________________________________________ linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev