[PATCH 0/2] cpuhotplug/nohz: Fix issue of "negative" idle time
On most architectures (arm, mips, s390, sh and x86) idle thread of a cpu does not cleanly exit nohz state before dying upon hot-remove. As a result, offline cpu is seen to be in nohz mode (ts->idle_active = 1) and its offline time can potentially be included in total idle time reported via /proc/stat. When the same cpu later comes online, its offline time however is not included in its idle time statistics, thus causing a rollback in total idle time to be observed by applications like top. Example output from Android top command highlighting this issue is below: User 232%, System 70%, IOW 46%, IRQ 1% User 1322 + Nice 0 + Sys 399 + Idle -1423 + IOW 264 + IRQ 0 + SIRQ 7 = 569 top is reporting system to be idle for -1423 ticks over some sampling period. This happens as total idle time reported in cpu line of /proc/stat *dropped* from the last value observed (cached) by top command. While this was originally seen on a ARM platform running 3.4 based kernel, I could easily recreate it on my x86 desktop running latest tip/master kernel (HEAD 3a7bfcad). Online/offline a cpu in a tight loop and in another loop read /proc/stat and observe if total idle time drops from previously read value. Although commit 7386cdbf (nohz: Fix idle ticks in cpu summary line of /proc/stat) aims to avoid this bug, its not preemption proof. A thread could get preempted after the cpu_online() check in get_idle_time(), thus potentially leading to get_cpu_idle_time_us() being invoked on a offline cpu. One potential fix is to serialize hotplug with /proc/stat read operation (via use of get/put_online_cpus()), which I disliked in favor of the other solution proposed in this series. In this patch series: - Patch 1/2 modifies idle loop on architectures arm, mips, s390, sh and x86 to exit nohz state before the associated idle thread dies upon hotremove. This fixes the idle time accounting bug. Patch 1/2 also modifies idle loop on all architectures supporting cpu hotplug to have idle thread of a dying cpu die immediately after schedule() returns control to it. I see no point in wasting time via calls to *_enter()/*_exit() before noticing the need to die and dying. - Patch 2/2 reverts commit 7386cdbf (nohz: Fix idle ticks in cpu summary line of /proc/stat). The cpu_online() check introduced by it is no longer necessary with Patch 1/2 applied. Having fewer code sites worry about online status of cpus is a good thing! --- arch/arm/kernel/process.c |9 - arch/arm/kernel/smp.c |2 +- arch/blackfin/kernel/process.c |8 arch/mips/kernel/process.c |6 +++--- arch/powerpc/kernel/idle.c |2 +- arch/s390/kernel/process.c |4 ++-- arch/sh/kernel/idle.c |5 ++--- arch/sparc/kernel/process_64.c |3 ++- arch/x86/kernel/process.c |5 ++--- fs/proc/stat.c | 14 -- 10 files changed, 25 insertions(+), 33 deletions(-) -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/2] cpuhotplug/nohz: Remove offline cpus from nohz-idle state
Modify idle loop of arm, mips, s390, sh and x86 architectures to exit from nohz state before dying upon hot-remove. This change is needed to avoid userspace tools like top command from seeing a rollback in total idle time over some sampling periods. Additionaly, modify idle loop on all architectures supporting cpu hotplug to have idle thread of a dying cpu die immediately after scheduler returns control to it. There is no point in wasting time via calls to *_enter()/*_exit() before noticing the need to die and dying. Additional ARM specific change: Revert commit ff081e05 ("ARM: 7457/1: smp: Fix suspicious RCU originating from cpu_die()"), which added a RCU_NONIDLE() wrapper around call to complete(). That wrapper is no longer needed as cpu_die() is now called outside of a rcu_idle_enter()/exit() section. I also think that the wait_for_completion() based wait in ARM's __cpu_die() can be replaced with a busy-loop based one, as the wait there in general should be terminated within few cycles. Cc: Russell King Cc: Paul E. McKenney Cc: Stephen Boyd Cc: linux-arm-ker...@lists.infradead.org Cc: Mike Frysinger Cc: uclinux-dist-de...@blackfin.uclinux.org Cc: Ralf Baechle Cc: linux-m...@linux-mips.org Cc: Benjamin Herrenschmidt Cc: linuxppc-dev@lists.ozlabs.org Cc: Martin Schwidefsky Cc: linux-s...@vger.kernel.org Cc: Paul Mundt Cc: linux...@vger.kernel.org Cc: "David S. Miller" Cc: sparcli...@vger.kernel.org Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: x...@kernel.org Cc: mho...@suse.cz Cc: srivatsa.b...@linux.vnet.ibm.com Signed-off-by: Srivatsa Vaddagiri --- arch/arm/kernel/process.c |9 - arch/arm/kernel/smp.c |2 +- arch/blackfin/kernel/process.c |8 arch/mips/kernel/process.c |6 +++--- arch/powerpc/kernel/idle.c |2 +- arch/s390/kernel/process.c |4 ++-- arch/sh/kernel/idle.c |5 ++--- arch/sparc/kernel/process_64.c |3 ++- arch/x86/kernel/process.c |5 ++--- 9 files changed, 21 insertions(+), 23 deletions(-) diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c index c6dec5f..254099b 100644 --- a/arch/arm/kernel/process.c +++ b/arch/arm/kernel/process.c @@ -191,11 +191,6 @@ void cpu_idle(void) rcu_idle_enter(); ledtrig_cpu(CPU_LED_IDLE_START); while (!need_resched()) { -#ifdef CONFIG_HOTPLUG_CPU - if (cpu_is_offline(smp_processor_id())) - cpu_die(); -#endif - /* * We need to disable interrupts here * to ensure we don't miss a wakeup call. @@ -224,6 +219,10 @@ void cpu_idle(void) rcu_idle_exit(); tick_nohz_idle_exit(); schedule_preempt_disabled(); +#ifdef CONFIG_HOTPLUG_CPU + if (cpu_is_offline(smp_processor_id())) + cpu_die(); +#endif } } diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c index 84f4cbf..a8e3b8a 100644 --- a/arch/arm/kernel/smp.c +++ b/arch/arm/kernel/smp.c @@ -251,7 +251,7 @@ void __ref cpu_die(void) mb(); /* Tell __cpu_die() that this CPU is now safe to dispose of */ - RCU_NONIDLE(complete(&cpu_died)); + complete(&cpu_died); /* * actual CPU shutdown procedure is at least platform (if not diff --git a/arch/blackfin/kernel/process.c b/arch/blackfin/kernel/process.c index 3e16ad9..2bee1af 100644 --- a/arch/blackfin/kernel/process.c +++ b/arch/blackfin/kernel/process.c @@ -83,10 +83,6 @@ void cpu_idle(void) while (1) { void (*idle)(void) = pm_idle; -#ifdef CONFIG_HOTPLUG_CPU - if (cpu_is_offline(smp_processor_id())) - cpu_die(); -#endif if (!idle) idle = default_idle; tick_nohz_idle_enter(); @@ -98,6 +94,10 @@ void cpu_idle(void) preempt_enable_no_resched(); schedule(); preempt_disable(); +#ifdef CONFIG_HOTPLUG_CPU + if (cpu_is_offline(smp_processor_id())) + cpu_die(); +#endif } } diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c index a11c6f9..41102a0 100644 --- a/arch/mips/kernel/process.c +++ b/arch/mips/kernel/process.c @@ -71,13 +71,13 @@ void __noreturn cpu_idle(void) start_critical_timings(); } } + rcu_idle_exit(); + tick_nohz_idle_exit(); + schedule_preempt_disabled(); #ifdef CONFIG_HOTPLUG_CPU if (!cpu_online(cpu) && !cpu_isset(cpu, cpu_callin_map)) play_dead(); #endif - rcu_idle_exit(); - tick_nohz_idle_exit(); - schedule_preempt_disabled();
[PATCH 2/2] Revert "nohz: Fix idle ticks in cpu summary line of /proc/stat" (commit 7386cdbf2f57ea8cff3c9fde93f206e58b9fe13f).
With offline cpus no longer beeing seen in nohz mode (ts->idle_active=0), we don't need the check for cpu_online() introduced in commit 7386cdbf. Offline cpu's idle time as last recorded in its ts->idle_sleeptime will be reported (thus excluding its offline time as part of idle time statistics). Cc: mho...@suse.cz Cc: srivatsa.b...@linux.vnet.ibm.com Signed-off-by: Srivatsa Vaddagiri --- fs/proc/stat.c | 14 -- 1 files changed, 4 insertions(+), 10 deletions(-) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index e296572..64c3b31 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -45,13 +45,10 @@ static cputime64_t get_iowait_time(int cpu) static u64 get_idle_time(int cpu) { - u64 idle, idle_time = -1ULL; - - if (cpu_online(cpu)) - idle_time = get_cpu_idle_time_us(cpu, NULL); + u64 idle, idle_time = get_cpu_idle_time_us(cpu, NULL); if (idle_time == -1ULL) - /* !NO_HZ or cpu offline so we can rely on cpustat.idle */ + /* !NO_HZ so we can rely on cpustat.idle */ idle = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE]; else idle = usecs_to_cputime64(idle_time); @@ -61,13 +58,10 @@ static u64 get_idle_time(int cpu) static u64 get_iowait_time(int cpu) { - u64 iowait, iowait_time = -1ULL; - - if (cpu_online(cpu)) - iowait_time = get_cpu_iowait_time_us(cpu, NULL); + u64 iowait, iowait_time = get_cpu_iowait_time_us(cpu, NULL); if (iowait_time == -1ULL) - /* !NO_HZ or cpu offline so we can rely on cpustat.iowait */ + /* !NO_HZ so we can rely on cpustat.iowait */ iowait = kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT]; else iowait = usecs_to_cputime64(iowait_time); -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/2] Revert "nohz: Fix idle ticks in cpu summary line of /proc/stat" (commit 7386cdbf2f57ea8cff3c9fde93f206e58b9fe13f).
* Sergei Shtylyov [2013-01-04 16:13:42]: > >With offline cpus no longer beeing seen in nohz mode (ts->idle_active=0), we > >don't need the check for cpu_online() introduced in commit 7386cdbf. Offline > >Please also specify the summary of that commit in parens (or > however you like). I had that in Subject line, but yes would be good to include in commit message as well. I will incorporate that change alongwith anything else required in next version of this patch. - vatsa ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2] cpuhotplug/nohz: Remove offline cpus from nohz-idle state
* Russell King - ARM Linux [2013-01-05 10:36:27]: > On Thu, Jan 03, 2013 at 06:58:38PM -0800, Srivatsa Vaddagiri wrote: > > I also think that the > > wait_for_completion() based wait in ARM's __cpu_die() can be replaced with a > > busy-loop based one, as the wait there in general should be terminated > > within > > few cycles. > > Why open-code this stuff when we have infrastructure already in the kernel > for waiting for stuff to happen? I chose to use the standard infrastructure > because its better tested, and avoids having to think about whether we need > CPU barriers and such like to ensure that updates are seen in a timely > manner. I was primarily thinking of calling as few generic functions as possible on a dead cpu. I recall several "am I running on a dead cpu?" checks (cpu_is_offline(this_cpu) that were put in generic routines during early versions of cpu hotplug [1] to educate code running on dead cpu, the need for which went away though with introduction of atomic/stop-machine variant. The need to add a RCU_NONIDLE() wrapper around ARM's cpu_die() [2] is perhaps a more recent example of educating code running on dead cpu. As quickly we die as possible after idle thread of dying cpu gains control, the better! 1. http://lwn.net/Articles/69040/ 2. http://lists.infradead.org/pipermail/linux-arm-kernel/2012-July/107971.html - vatsa -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/5] sched: fix capacity calculations for SMT4
On Mon, May 31, 2010 at 10:33:16AM +0200, Peter Zijlstra wrote: > On Fri, 2010-04-16 at 15:58 +0200, Peter Zijlstra wrote: > > > > > > Hrmm, my brain seems muddled but I might have another solution, let me > > ponder this for a bit.. > > > > Right, so the thing I was thinking about is taking the group capacity > into account when determining the capacity for a single cpu. Peter, We are exploring an alternate solution which seems to be working as expected. Basically allow capacity of 1 for SMT threads provided there is no significant influence by RT tasks or freq scaling. Note that at core level, capacity is unchanged and hence this affects only how tasks are distributed within a core. Mike Neuling should post an updated patchset containing this patch (with more comments added ofcourse!). Signed-off-by: Srivatsa Vaddagiri --- include/linux/sched.h |2 +- kernel/sched_fair.c | 30 +++--- 2 files changed, 24 insertions(+), 8 deletions(-) Index: linux-2.6-ozlabs/include/linux/sched.h === --- linux-2.6-ozlabs.orig/include/linux/sched.h +++ linux-2.6-ozlabs/include/linux/sched.h @@ -860,7 +860,7 @@ struct sched_group { * CPU power of this group, SCHED_LOAD_SCALE being max power for a * single CPU. */ - unsigned int cpu_power; + unsigned int cpu_power, cpu_power_orig; /* * The CPUs this group covers. Index: linux-2.6-ozlabs/kernel/sched_fair.c === --- linux-2.6-ozlabs.orig/kernel/sched_fair.c +++ linux-2.6-ozlabs/kernel/sched_fair.c @@ -2285,13 +2285,6 @@ static void update_cpu_power(struct sche unsigned long power = SCHED_LOAD_SCALE; struct sched_group *sdg = sd->groups; - if (sched_feat(ARCH_POWER)) - power *= arch_scale_freq_power(sd, cpu); - else - power *= default_scale_freq_power(sd, cpu); - - power >>= SCHED_LOAD_SHIFT; - if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) { if (sched_feat(ARCH_POWER)) power *= arch_scale_smt_power(sd, cpu); @@ -2301,6 +2294,15 @@ static void update_cpu_power(struct sche power >>= SCHED_LOAD_SHIFT; } + sdg->cpu_power_orig = power; + + if (sched_feat(ARCH_POWER)) + power *= arch_scale_freq_power(sd, cpu); + else + power *= default_scale_freq_power(sd, cpu); + + power >>= SCHED_LOAD_SHIFT; + power *= scale_rt_power(cpu); power >>= SCHED_LOAD_SHIFT; @@ -2333,6 +2335,22 @@ static void update_group_power(struct sc sdg->cpu_power = power; } +static inline int +rt_freq_influence(struct sched_group *group, struct sched_domain *sd) +{ + if (sd->child) + return 1; + + /* +* Check to see if the final cpu power was reduced by more +* than 10% by frequency or rt tasks +*/ + if (group->cpu_power * 100 < group->cpu_power_orig * 90) + return 1; + + return 0; +} + /** * update_sg_lb_stats - Update sched_group's statistics for load balancing. * @sd: The sched_domain whose statistics are to be updated. @@ -2426,6 +2444,8 @@ static inline void update_sg_lb_stats(st sgs->group_capacity = DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE); + if (!sgs->group_capacity && !rt_freq_influence(group, sd)) + sgs->group_capacity = 1; } /** @@ -2725,7 +2745,8 @@ ret: */ static struct rq * find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle, - unsigned long imbalance, const struct cpumask *cpus) + unsigned long imbalance, const struct cpumask *cpus, + struct sched_domain *sd) { struct rq *busiest = NULL, *rq; unsigned long max_load = 0; @@ -2736,6 +2757,9 @@ find_busiest_queue(struct sched_group *g unsigned long capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE); unsigned long wl; + if (!capacity && !rt_freq_influence(group, sd)) + capacity = 1; + if (!cpumask_test_cpu(i, cpus)) continue; @@ -2852,7 +2876,7 @@ redo: goto out_balanced; } - busiest = find_busiest_queue(group, idle, imbalance, cpus); + busiest = find_busiest_queue(group, idle, imbalance, cpus, sd); if (!busiest) { schedstat_inc(sd, lb_nobusyq[idle]); goto out_balanced; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: ppc32: Weird process scheduling behaviour with 2.6.24-rc
On Fri, Jan 25, 2008 at 09:50:00AM +0100, Peter Zijlstra wrote: > > On Fri, 2008-01-25 at 18:25 +1100, Benjamin Herrenschmidt wrote: > > On Fri, 2008-01-25 at 18:03 +1100, Benjamin Herrenschmidt wrote: > > > On Fri, 2008-01-25 at 17:54 +1100, Benjamin Herrenschmidt wrote: > > > > > > > > Here, I do the test of running 4 times the repro-case provided by Michel > > > > with nice 19 and a dd eating CPU with nice 0. > > > > > > > > Without this option, I get the dd at 100% and the nice 19 shells down > > > > below it with whatever is left of the CPUs. > > > > > > > > With this option, dd gets about 50% of one CPU and the niced processes > > > > still get most of the time. Ben, I presume you had CONFIG_FAIR_USER_SCHED turned on too? Also were the dd process and the niced processes running under different user ids? If so, that is expected behavior, that we divide CPU equally among users first and then among the processes within each user. > > > FYI. This is a 4 way G5 (ppc64) > > > > I also tested responsiveness of X running with or without that option > > and with niced CPU eaters in the background (still 4 of them, one per > > CPU), and I can confirm Michel observations, it gets very sluggish > > (maybe not -as- bad as his but still pretty annoying) with the fair > > group scheduler enabled. When CONFIG_FAIR_GROUP_SCHED (and CONFIG_FAIR_USER_SCHED) is not enabled, X will be given higher priority for running on CPU when compared to other niced tasks. When the above options are turned on, X (running under root uid) would be given lesser priority to run when compared to other niced tasks running user different uids. Hence I expect some drop in interactivity-experience with FAIR_GROUP_SCHED on. Can you pls let me know if any of these makes a difference: 1. Run niced tasks as root. This would bring X and niced tasks in the same "scheduler group" domain, which would give X much more CPU power when compared to niced tasks. 2. Keep the niced tasks running under a non-root uid, but increase root users cpu share. # echo 8192 > /sys/kernel/uids/0/cpu_share This should bump up root user's priority for running on CPU and also give a better desktop experience. > > Here, X is running with nice=0 > > Curious, sounds like an issue with the group load balancer, vatsa, any > ideas? The group scheduler's SMP-load balance in 2.6.24 is not the best it could be. sched-devel has a better load balancer, which I am presuming will go into 2.6.25 soon. In this case, I suspect that's not the issue. If X and the niced processes are running under different uids, this (niced processes getting more cpu power) is on expected lines. Will wait for Ben to confirm this. -- Regards, vatsa ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: ppc32: Weird process scheduling behaviour with 2.6.24-rc
On Sat, Jan 26, 2008 at 03:13:54PM +1100, Benjamin Herrenschmidt wrote: > > Ben, > > I presume you had CONFIG_FAIR_USER_SCHED turned on too? > > Yes. It seems to be automatically turned on whenever FAIR_GROUP is > turned on. Considering how bad the behaviour is for a standard desktop > configuration, I'd be tempted to say to change it to default n. If I recall, CONFIG_FAIR_USER_SCHED was turned on as default at the same time as CONFIG_FAIR_GROUP_SCHED as a means to flesh out fair-group scheduler bugs. Also at that time, CONFIG_FAIR_CGROUP_SCHED was not available in mainline as the second option for grouping tasks. Going forward, I am of the favor to turn off CONFIG_FAIR_USER_SCHED by default, but turning on CONFIG_FAIR_GROUP_SCHED + CONFIG_FAIR_CGROUP_SCHED on by default. That way all tasks belong to same group by default unless admin explicitly creates groups and moves around tasks between them. This will be good for desktop user who may choose to keep all tasks in one group by default, but also giving him/her the flexibility of exploiting fair-group scheduler by creating custom task groups and adjusting their cpu shares (for ex: kernel compile group or multi-media group). If someone still needs the fair-user scheduler (as provided by CONFIG_FAIR_USER_SCHED), they can still get it with CONFIG_FAIR_CGROUP_SCHED by running a daemon [1] that dynamically moves around tasks into different task group based on userid. Ingo/Peter, what do you think? > > Also were the > > dd process and the niced processes running under different user ids? If > > so, that is expected behavior, that we divide CPU equally among > > users first and then among the processes within each user. > > They were different users and that behaviour seems to be a very stupid > default behaviour for a desktop machine. Take this situation: > > - X running as root > - User apps running as "user" > - Background crap (indexing daemons etc...) running as their own user > or nobody > > Unless you can get some kind of grouping based on user sessions > including suid binaries, X etc... I think this shouldn't default y in > Kconfig. yes, see above. > Not that it seems that Michel reported far worse behaviour than what I > saw, including pretty hickup'ish X behaviour even without the fair group > scheduler compared to 2.6.23. It might be because he's running X niced > to -1 (I leave X at 0 and let the scheduler deal with it in general) > though. Hmm ..with X niced to -1, it should get more cpu power leading to a better desktop experience. Michel, You had reported that commit 810e95ccd58d91369191aa4ecc9e6d4a10d8d0c8 was the cause for this bad behavior. Do you see behavior change (from good->bad) immediately after applying that patch during your bisect process? > > 2. Keep the niced tasks running under a non-root uid, but increase root > > users > >cpu share. > > # echo 8192 > /sys/kernel/uids/0/cpu_share > > > >This should bump up root user's priority for running on CPU and also > >give a better desktop experience. > > Allright, that's something that might need to be set by default by the > kernel ... as it will take some time to have knowledge of those knobs to > percolate to distros. Too bad you can't do the opposite by default for > "nobody" as there's no standard uid for it. > > > The group scheduler's SMP-load balance in 2.6.24 is not the best it > > could be. sched-devel has a better load balancer, which I am presuming > > will go into 2.6.25 soon. > > > > In this case, I suspect that's not the issue. If X and the niced processes > > are > > running under different uids, this (niced processes getting more cpu power) > > is > > on expected lines. Will wait for Ben to confirm this. > > I would suggest turning the fair group scheduler to default n in stable > for now. I would prefer to have CONFIG_FAIR_GROUP_SCHED + CONFIG_FAIR_CGROUP_SCHED on by default. Can you pls let me know how you think is the desktop experience with that combination? Reference: 1. http://article.gmane.org/gmane.linux.kernel/553267 -- Regards, vatsa ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: ppc32: Weird process scheduling behaviour with 2.6.24-rc
On Sat, Jan 26, 2008 at 03:13:54PM +1100, Benjamin Herrenschmidt wrote: > > Ben, > > I presume you had CONFIG_FAIR_USER_SCHED turned on too? > > Yes. It seems to be automatically turned on whenever FAIR_GROUP is > turned on. Considering how bad the behaviour is for a standard desktop > configuration, I'd be tempted to say to change it to default n. If I recall, CONFIG_FAIR_USER_SCHED was turned on as default at the same time as CONFIG_FAIR_GROUP_SCHED as a means to flesh out fair-group scheduler bugs. Also at that time, CONFIG_FAIR_CGROUP_SCHED was not available in mainline as the second option for grouping tasks. Going forward, I am of the favor to turn off CONFIG_FAIR_USER_SCHED by default, but turning on CONFIG_FAIR_GROUP_SCHED + CONFIG_FAIR_CGROUP_SCHED on by default. That way all tasks belong to same group by default unless admin explicitly creates groups and moves around tasks between them. This will be good for desktop user who may choose to keep all tasks in one group by default, but also giving him/her the flexibility of exploiting fair-group scheduler by creating custom task groups and adjusting their cpu shares (for ex: kernel compile group or multi-media group). If someone still needs the fair-user scheduler (as provided by CONFIG_FAIR_USER_SCHED), they can still get it with CONFIG_FAIR_CGROUP_SCHED by running a daemon [1] that dynamically moves around tasks into different task group based on userid. Ingo/Peter, what do you think? > > Also were the > > dd process and the niced processes running under different user ids? If > > so, that is expected behavior, that we divide CPU equally among > > users first and then among the processes within each user. > > They were different users and that behaviour seems to be a very stupid > default behaviour for a desktop machine. Take this situation: > > - X running as root > - User apps running as "user" > - Background crap (indexing daemons etc...) running as their own user > or nobody > > Unless you can get some kind of grouping based on user sessions > including suid binaries, X etc... I think this shouldn't default y in > Kconfig. yes, see above. > Not that it seems that Michel reported far worse behaviour than what I > saw, including pretty hickup'ish X behaviour even without the fair group > scheduler compared to 2.6.23. It might be because he's running X niced > to -1 (I leave X at 0 and let the scheduler deal with it in general) > though. Hmm ..with X niced to -1, it should get more cpu power leading to a better desktop experience. Michel, You had reported that commit 810e95ccd58d91369191aa4ecc9e6d4a10d8d0c8 was the cause for this bad behavior. Do you see behavior change (from good->bad) immediately after applying that patch during your bisect process? > > 2. Keep the niced tasks running under a non-root uid, but increase root > > users > >cpu share. > > # echo 8192 > /sys/kernel/uids/0/cpu_share > > > >This should bump up root user's priority for running on CPU and also > >give a better desktop experience. > > Allright, that's something that might need to be set by default by the > kernel ... as it will take some time to have knowledge of those knobs to > percolate to distros. Too bad you can't do the opposite by default for > "nobody" as there's no standard uid for it. > > > The group scheduler's SMP-load balance in 2.6.24 is not the best it > > could be. sched-devel has a better load balancer, which I am presuming > > will go into 2.6.25 soon. > > > > In this case, I suspect that's not the issue. If X and the niced processes > > are > > running under different uids, this (niced processes getting more cpu power) > > is > > on expected lines. Will wait for Ben to confirm this. > > I would suggest turning the fair group scheduler to default n in stable > for now. I would prefer to have CONFIG_FAIR_GROUP_SCHED + CONFIG_FAIR_CGROUP_SCHED on by default. Can you pls let me know how you think is the desktop experience with that combination? Reference: 1. http://article.gmane.org/gmane.linux.kernel/553267 -- Regards, vatsa ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: ppc32: Weird process scheduling behaviour with 2.6.24-rc
On Sat, Jan 26, 2008 at 04:15:52PM +1100, Benjamin Herrenschmidt wrote: > > Michel, > > You had reported that commit 810e95ccd58d91369191aa4ecc9e6d4a10d8d0c8 > > was the cause for this bad behavior. Do you see behavior change (from > > good->bad) > > immediately after applying that patch during your bisect process? > > Also Michel, double check your .config in both cases. And also Michel whether CONFIG_FAIR_GROUP_SCHED + CONFIG_FAIR_CGROUP_SCHED gives more or less same desktop exp as !CONFIG_FAIR_GROUP_SCHED pls! > > I would prefer to have CONFIG_FAIR_GROUP_SCHED + > > CONFIG_FAIR_CGROUP_SCHED on by default. Can you pls let me know how you > > think is the desktop experience with that combination? > > I'm going to give that a try but unfortunately, it will have to wait > until I'm back from LCA in a bit more than a week. -- Regards, vatsa ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: ppc32: Weird process scheduling behaviour with 2.6.24-rc
On Mon, Jan 28, 2008 at 10:14:33AM +0100, Michel Dänzer wrote: > > > * With CONFIG_FAIR_USER_SCHED enabled, X becomes basically > > > unusable with a niced CPU hog, with or without top running. I > > > don't know when this started, possibly when this option was > > > first introduced. > > > > Srivatsa found an issue that might explain the very bad behaviour under > > group scheduling. But I gather you're not at all interested in this > > feature? > > That's right, but it's good to hear you have a lead there as well, and > if you can't find any interested testers, let me know and I'll try. Michel, Thanks for offering to test! The issue I found wrt preemption latency (when FAIR_USER_SCHED is turned on) is explained here: http://marc.info/?l=linux-kernel&m=120148675326287 Does the patch in that URL help bring FAIR_USER_SCHED interactivity to the same level as !FAIR_USER_SCHED? -- Regards, vatsa ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: ppc32: Weird process scheduling behaviour with 2.6.24-rc
On Mon, Jan 28, 2008 at 01:32:53PM +0100, Ingo Molnar wrote: > * Peter Zijlstra <[EMAIL PROTECTED]> wrote: > > > > * With CONFIG_FAIR_USER_SCHED disabled, there are severe > > > interactivity hickups with a niced CPU hog and top running. This > > > started with commit 810e95ccd58d91369191aa4ecc9e6d4a10d8d0c8. > > > > The revert at the bottom causes the wakeup granularity to shrink for + > > nice and to grow for - nice. That is, it becomes easier to preempt a + > > nice task, and harder to preempt a - nice task. > > i think it would be OK to do half of this: make it easier to preempt a > +nice task. Hmm .. I doubt whether that would help Michel's case, as he seems to be running +niced tasks and having problems getting control over his desktop. Something is basically wrong here .. > Michel, do you really need the -nice portion as well? It's > not a problem to super-preempt positively reniced tasks, but it can be > quite annoying if negatively reniced tasks have super-slices. -- Regards, vatsa ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev