[PATCH 0/2] cpuhotplug/nohz: Fix issue of "negative" idle time

2013-01-03 Thread Srivatsa Vaddagiri
On most architectures (arm, mips, s390, sh and x86) idle thread of a cpu does
not cleanly exit nohz state before dying upon hot-remove. As a result,
offline cpu is seen to be in nohz mode (ts->idle_active = 1) and its offline
time can potentially be included in total idle time reported via /proc/stat.
When the same cpu later comes online, its offline time however is not included
in its idle time statistics, thus causing a rollback in total idle time to be
observed by applications like top.

Example output from Android top command highlighting this issue is below:

User 232%, System 70%, IOW 46%, IRQ 1%
User 1322 + Nice 0 + Sys 399 + Idle -1423 + IOW 264 + IRQ 0 + SIRQ 7 = 569

top is reporting system to be idle for -1423 ticks over some sampling period.
This happens as total idle time reported in cpu line of /proc/stat *dropped*
from the last value observed (cached) by top command.

While this was originally seen on a ARM platform running 3.4 based kernel, I
could easily recreate it on my x86 desktop running latest tip/master kernel
(HEAD 3a7bfcad). Online/offline a cpu in a tight loop and in another loop read
/proc/stat and observe if total idle time drops from previously read value.

Although commit 7386cdbf (nohz: Fix idle ticks in cpu summary line of
/proc/stat) aims to avoid this bug, its not preemption proof. A
thread could get preempted after the cpu_online() check in get_idle_time(), thus
potentially leading to get_cpu_idle_time_us() being invoked on a offline cpu.

One potential fix is to serialize hotplug with /proc/stat read operation (via
use of get/put_online_cpus()), which I disliked in favor of the other
solution proposed in this series.

In this patch series:

- Patch 1/2 modifies idle loop on architectures arm, mips, s390, sh and x86 to
  exit nohz state before the associated idle thread dies upon hotremove. This
  fixes the idle time accounting bug.

  Patch 1/2 also modifies idle loop on all architectures supporting cpu hotplug
  to have idle thread of a dying cpu die immediately after schedule() returns
  control to it. I see no point in wasting time via calls to *_enter()/*_exit()
  before noticing the need to die and dying.

- Patch 2/2 reverts commit 7386cdbf (nohz: Fix idle ticks in cpu summary line of
  /proc/stat). The cpu_online() check introduced by it is no longer necessary
  with Patch 1/2 applied. Having fewer code sites worry about online status of
  cpus is a good thing!

---

 arch/arm/kernel/process.c  |9 -
 arch/arm/kernel/smp.c  |2 +-
 arch/blackfin/kernel/process.c |8 
 arch/mips/kernel/process.c |6 +++---
 arch/powerpc/kernel/idle.c |2 +-
 arch/s390/kernel/process.c |4 ++--
 arch/sh/kernel/idle.c  |5 ++---
 arch/sparc/kernel/process_64.c |3 ++-
 arch/x86/kernel/process.c  |5 ++---
 fs/proc/stat.c |   14 --
 10 files changed, 25 insertions(+), 33 deletions(-)
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 1/2] cpuhotplug/nohz: Remove offline cpus from nohz-idle state

2013-01-03 Thread Srivatsa Vaddagiri
Modify idle loop of arm, mips, s390, sh and x86 architectures to exit from nohz
state before dying upon hot-remove. This change is needed to avoid userspace
tools like top command from seeing a rollback in total idle time over some
sampling periods.

Additionaly, modify idle loop on all architectures supporting cpu hotplug to
have idle thread of a dying cpu die immediately after scheduler returns control
to it. There is no point in wasting time via calls to *_enter()/*_exit() before
noticing the need to die and dying.

Additional ARM specific change:
Revert commit ff081e05 ("ARM: 7457/1: smp: Fix suspicious
RCU originating from cpu_die()"), which added a RCU_NONIDLE() wrapper
around call to complete(). That wrapper is no longer needed as cpu_die() is
now called outside of a rcu_idle_enter()/exit() section. I also think that the
wait_for_completion() based wait in ARM's __cpu_die() can be replaced with a
busy-loop based one, as the wait there in general should be terminated within
few cycles.

Cc: Russell King 
Cc: Paul E. McKenney 
Cc: Stephen Boyd 
Cc: linux-arm-ker...@lists.infradead.org
Cc: Mike Frysinger 
Cc: uclinux-dist-de...@blackfin.uclinux.org
Cc: Ralf Baechle 
Cc: linux-m...@linux-mips.org
Cc: Benjamin Herrenschmidt 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Martin Schwidefsky 
Cc: linux-s...@vger.kernel.org
Cc: Paul Mundt 
Cc: linux...@vger.kernel.org
Cc: "David S. Miller" 
Cc: sparcli...@vger.kernel.org
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: mho...@suse.cz
Cc: srivatsa.b...@linux.vnet.ibm.com
Signed-off-by: Srivatsa Vaddagiri 
---
 arch/arm/kernel/process.c  |9 -
 arch/arm/kernel/smp.c  |2 +-
 arch/blackfin/kernel/process.c |8 
 arch/mips/kernel/process.c |6 +++---
 arch/powerpc/kernel/idle.c |2 +-
 arch/s390/kernel/process.c |4 ++--
 arch/sh/kernel/idle.c  |5 ++---
 arch/sparc/kernel/process_64.c |3 ++-
 arch/x86/kernel/process.c  |5 ++---
 9 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index c6dec5f..254099b 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -191,11 +191,6 @@ void cpu_idle(void)
rcu_idle_enter();
ledtrig_cpu(CPU_LED_IDLE_START);
while (!need_resched()) {
-#ifdef CONFIG_HOTPLUG_CPU
-   if (cpu_is_offline(smp_processor_id()))
-   cpu_die();
-#endif
-
/*
 * We need to disable interrupts here
 * to ensure we don't miss a wakeup call.
@@ -224,6 +219,10 @@ void cpu_idle(void)
rcu_idle_exit();
tick_nohz_idle_exit();
schedule_preempt_disabled();
+#ifdef CONFIG_HOTPLUG_CPU
+   if (cpu_is_offline(smp_processor_id()))
+   cpu_die();
+#endif
}
 }
 
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 84f4cbf..a8e3b8a 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -251,7 +251,7 @@ void __ref cpu_die(void)
mb();
 
/* Tell __cpu_die() that this CPU is now safe to dispose of */
-   RCU_NONIDLE(complete(&cpu_died));
+   complete(&cpu_died);
 
/*
 * actual CPU shutdown procedure is at least platform (if not
diff --git a/arch/blackfin/kernel/process.c b/arch/blackfin/kernel/process.c
index 3e16ad9..2bee1af 100644
--- a/arch/blackfin/kernel/process.c
+++ b/arch/blackfin/kernel/process.c
@@ -83,10 +83,6 @@ void cpu_idle(void)
while (1) {
void (*idle)(void) = pm_idle;
 
-#ifdef CONFIG_HOTPLUG_CPU
-   if (cpu_is_offline(smp_processor_id()))
-   cpu_die();
-#endif
if (!idle)
idle = default_idle;
tick_nohz_idle_enter();
@@ -98,6 +94,10 @@ void cpu_idle(void)
preempt_enable_no_resched();
schedule();
preempt_disable();
+#ifdef CONFIG_HOTPLUG_CPU
+   if (cpu_is_offline(smp_processor_id()))
+   cpu_die();
+#endif
}
 }
 
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index a11c6f9..41102a0 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -71,13 +71,13 @@ void __noreturn cpu_idle(void)
start_critical_timings();
}
}
+   rcu_idle_exit();
+   tick_nohz_idle_exit();
+   schedule_preempt_disabled();
 #ifdef CONFIG_HOTPLUG_CPU
if (!cpu_online(cpu) && !cpu_isset(cpu, cpu_callin_map))
play_dead();
 #endif
-   rcu_idle_exit();
-   tick_nohz_idle_exit();
-   schedule_preempt_disabled();

[PATCH 2/2] Revert "nohz: Fix idle ticks in cpu summary line of /proc/stat" (commit 7386cdbf2f57ea8cff3c9fde93f206e58b9fe13f).

2013-01-03 Thread Srivatsa Vaddagiri
With offline cpus no longer beeing seen in nohz mode (ts->idle_active=0), we
don't need the check for cpu_online() introduced in commit 7386cdbf. Offline
cpu's idle time as last recorded in its ts->idle_sleeptime will be reported
(thus excluding its offline time as part of idle time statistics).

Cc: mho...@suse.cz
Cc: srivatsa.b...@linux.vnet.ibm.com
Signed-off-by: Srivatsa Vaddagiri 
---
 fs/proc/stat.c |   14 --
 1 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index e296572..64c3b31 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -45,13 +45,10 @@ static cputime64_t get_iowait_time(int cpu)
 
 static u64 get_idle_time(int cpu)
 {
-   u64 idle, idle_time = -1ULL;
-
-   if (cpu_online(cpu))
-   idle_time = get_cpu_idle_time_us(cpu, NULL);
+   u64 idle, idle_time = get_cpu_idle_time_us(cpu, NULL);
 
if (idle_time == -1ULL)
-   /* !NO_HZ or cpu offline so we can rely on cpustat.idle */
+   /* !NO_HZ so we can rely on cpustat.idle */
idle = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE];
else
idle = usecs_to_cputime64(idle_time);
@@ -61,13 +58,10 @@ static u64 get_idle_time(int cpu)
 
 static u64 get_iowait_time(int cpu)
 {
-   u64 iowait, iowait_time = -1ULL;
-
-   if (cpu_online(cpu))
-   iowait_time = get_cpu_iowait_time_us(cpu, NULL);
+   u64 iowait, iowait_time = get_cpu_iowait_time_us(cpu, NULL);
 
if (iowait_time == -1ULL)
-   /* !NO_HZ or cpu offline so we can rely on cpustat.iowait */
+   /* !NO_HZ so we can rely on cpustat.iowait */
iowait = kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT];
else
iowait = usecs_to_cputime64(iowait_time);
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 2/2] Revert "nohz: Fix idle ticks in cpu summary line of /proc/stat" (commit 7386cdbf2f57ea8cff3c9fde93f206e58b9fe13f).

2013-01-04 Thread Srivatsa Vaddagiri
* Sergei Shtylyov  [2013-01-04 16:13:42]:

> >With offline cpus no longer beeing seen in nohz mode (ts->idle_active=0), we
> >don't need the check for cpu_online() introduced in commit 7386cdbf. Offline
> 
>Please also specify the summary of that commit in parens (or
> however you like).

I had that in Subject line, but yes would be good to include in commit message
as well. I will incorporate that change alongwith anything else required in
next version of this patch.

- vatsa
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/2] cpuhotplug/nohz: Remove offline cpus from nohz-idle state

2013-01-07 Thread Srivatsa Vaddagiri
* Russell King - ARM Linux  [2013-01-05 10:36:27]:

> On Thu, Jan 03, 2013 at 06:58:38PM -0800, Srivatsa Vaddagiri wrote:
> > I also think that the
> > wait_for_completion() based wait in ARM's __cpu_die() can be replaced with a
> > busy-loop based one, as the wait there in general should be terminated 
> > within
> > few cycles.
> 
> Why open-code this stuff when we have infrastructure already in the kernel
> for waiting for stuff to happen?  I chose to use the standard infrastructure
> because its better tested, and avoids having to think about whether we need
> CPU barriers and such like to ensure that updates are seen in a timely
> manner.

I was primarily thinking of calling as few generic functions as possible on
a dead cpu. I recall several "am I running on a dead cpu?" checks
(cpu_is_offline(this_cpu) that were put in generic routines during early
versions of cpu hotplug [1] to educate code running on dead cpu, the need for
which went away though with introduction of atomic/stop-machine variant. The
need to add a RCU_NONIDLE() wrapper around ARM's cpu_die() [2] is perhaps a more
recent example of educating code running on dead cpu. As quickly we die as
possible after idle thread of dying cpu gains control, the better!

1. http://lwn.net/Articles/69040/
2. http://lists.infradead.org/pipermail/linux-arm-kernel/2012-July/107971.html

- vatsa
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/5] sched: fix capacity calculations for SMT4

2010-06-07 Thread Srivatsa Vaddagiri
On Mon, May 31, 2010 at 10:33:16AM +0200, Peter Zijlstra wrote:
> On Fri, 2010-04-16 at 15:58 +0200, Peter Zijlstra wrote:
> > 
> > 
> > Hrmm, my brain seems muddled but I might have another solution, let me
> > ponder this for a bit..
> > 
> 
> Right, so the thing I was thinking about is taking the group capacity
> into account when determining the capacity for a single cpu.

Peter,
We are exploring an alternate solution which seems to be working as
expected. Basically allow capacity of 1 for SMT threads provided there is
no significant influence by RT tasks or freq scaling. Note that at core level,
capacity is unchanged and hence this affects only how tasks are distributed
within a core.

Mike Neuling should post an updated patchset containing this patch
(with more comments added ofcourse!).


Signed-off-by: Srivatsa Vaddagiri 

---
 include/linux/sched.h |2 +-
 kernel/sched_fair.c   |   30 +++---
 2 files changed, 24 insertions(+), 8 deletions(-)

Index: linux-2.6-ozlabs/include/linux/sched.h
===
--- linux-2.6-ozlabs.orig/include/linux/sched.h
+++ linux-2.6-ozlabs/include/linux/sched.h
@@ -860,7 +860,7 @@ struct sched_group {
 * CPU power of this group, SCHED_LOAD_SCALE being max power for a
 * single CPU.
 */
-   unsigned int cpu_power;
+   unsigned int cpu_power, cpu_power_orig;
 
/*
 * The CPUs this group covers.
Index: linux-2.6-ozlabs/kernel/sched_fair.c
===
--- linux-2.6-ozlabs.orig/kernel/sched_fair.c
+++ linux-2.6-ozlabs/kernel/sched_fair.c
@@ -2285,13 +2285,6 @@ static void update_cpu_power(struct sche
unsigned long power = SCHED_LOAD_SCALE;
struct sched_group *sdg = sd->groups;
 
-   if (sched_feat(ARCH_POWER))
-   power *= arch_scale_freq_power(sd, cpu);
-   else
-   power *= default_scale_freq_power(sd, cpu);
-
-   power >>= SCHED_LOAD_SHIFT;
-
if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
if (sched_feat(ARCH_POWER))
power *= arch_scale_smt_power(sd, cpu);
@@ -2301,6 +2294,15 @@ static void update_cpu_power(struct sche
power >>= SCHED_LOAD_SHIFT;
}
 
+   sdg->cpu_power_orig = power;
+
+   if (sched_feat(ARCH_POWER))
+   power *= arch_scale_freq_power(sd, cpu);
+   else
+   power *= default_scale_freq_power(sd, cpu);
+
+   power >>= SCHED_LOAD_SHIFT;
+
power *= scale_rt_power(cpu);
power >>= SCHED_LOAD_SHIFT;
 
@@ -2333,6 +2335,22 @@ static void update_group_power(struct sc
sdg->cpu_power = power;
 }
 
+static inline int
+rt_freq_influence(struct sched_group *group, struct sched_domain *sd)
+{
+   if (sd->child)
+   return 1;
+
+   /*
+* Check to see if the final cpu power was reduced by more
+* than 10% by frequency or rt tasks
+*/
+   if (group->cpu_power * 100 < group->cpu_power_orig * 90)
+   return 1;
+
+   return 0;
+}
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @sd: The sched_domain whose statistics are to be updated.
@@ -2426,6 +2444,8 @@ static inline void update_sg_lb_stats(st
 
sgs->group_capacity =
DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
+   if (!sgs->group_capacity && !rt_freq_influence(group, sd))
+   sgs->group_capacity = 1;
 }
 
 /**
@@ -2725,7 +2745,8 @@ ret:
  */
 static struct rq *
 find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle,
-  unsigned long imbalance, const struct cpumask *cpus)
+  unsigned long imbalance, const struct cpumask *cpus,
+  struct sched_domain *sd)
 {
struct rq *busiest = NULL, *rq;
unsigned long max_load = 0;
@@ -2736,6 +2757,9 @@ find_busiest_queue(struct sched_group *g
unsigned long capacity = DIV_ROUND_CLOSEST(power, 
SCHED_LOAD_SCALE);
unsigned long wl;
 
+   if (!capacity && !rt_freq_influence(group, sd))
+   capacity = 1;
+
if (!cpumask_test_cpu(i, cpus))
continue;
 
@@ -2852,7 +2876,7 @@ redo:
goto out_balanced;
}
 
-   busiest = find_busiest_queue(group, idle, imbalance, cpus);
+   busiest = find_busiest_queue(group, idle, imbalance, cpus, sd);
if (!busiest) {
schedstat_inc(sd, lb_nobusyq[idle]);
goto out_balanced;


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: ppc32: Weird process scheduling behaviour with 2.6.24-rc

2008-01-26 Thread Srivatsa Vaddagiri
On Fri, Jan 25, 2008 at 09:50:00AM +0100, Peter Zijlstra wrote:
> 
> On Fri, 2008-01-25 at 18:25 +1100, Benjamin Herrenschmidt wrote:
> > On Fri, 2008-01-25 at 18:03 +1100, Benjamin Herrenschmidt wrote:
> > > On Fri, 2008-01-25 at 17:54 +1100, Benjamin Herrenschmidt wrote:
> > > > 
> > > > Here, I do the test of running 4 times the repro-case provided by Michel
> > > > with nice 19 and a dd eating CPU with nice 0.
> > > > 
> > > > Without this option, I get the dd at 100% and the nice 19 shells down
> > > > below it with whatever is left of the CPUs.
> > > > 
> > > > With this option, dd gets about 50% of one CPU and the niced processes
> > > > still get most of the time.

Ben,
I presume you had CONFIG_FAIR_USER_SCHED turned on too? Also were the
dd process and the niced processes running under different user ids? If
so, that is expected behavior, that we divide CPU equally among
users first and then among the processes within each user.

> > > FYI. This is a 4 way G5 (ppc64)
> > 
> > I also tested responsiveness of X running with or without that option
> > and with niced CPU eaters in the background (still 4 of them, one per
> > CPU), and I can confirm Michel observations, it gets very sluggish
> > (maybe not -as- bad as his but still pretty annoying) with the fair
> > group scheduler enabled.

When CONFIG_FAIR_GROUP_SCHED (and CONFIG_FAIR_USER_SCHED) is not
enabled, X will be given higher priority for running on CPU when compared to 
other niced tasks. When the above options are turned on, X (running
under root uid) would be given lesser priority to run when compared to other
niced tasks running user different uids. Hence I expect some drop in
interactivity-experience with FAIR_GROUP_SCHED on.

Can you pls let me know if any of these makes a difference:

1. Run niced tasks as root. This would bring X and niced tasks in the
same "scheduler group" domain, which would give X much more CPU power
when compared to niced tasks.

2. Keep the niced tasks running under a non-root uid, but increase root users 
   cpu share.
# echo 8192 > /sys/kernel/uids/0/cpu_share

   This should bump up root user's priority for running on CPU and also 
   give a better desktop experience.

> > Here, X is running with nice=0
> 
> Curious, sounds like an issue with the group load balancer, vatsa, any
> ideas?

The group scheduler's SMP-load balance in 2.6.24 is not the best it
could be. sched-devel has a better load balancer, which I am presuming
will go into 2.6.25 soon.

In this case, I suspect that's not the issue.  If X and the niced processes are 
running under different uids, this (niced processes getting more cpu power) is 
on expected lines. Will wait for Ben to confirm this. 

-- 
Regards,
vatsa
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: ppc32: Weird process scheduling behaviour with 2.6.24-rc

2008-01-26 Thread Srivatsa Vaddagiri
On Sat, Jan 26, 2008 at 03:13:54PM +1100, Benjamin Herrenschmidt wrote:
> > Ben,
> > I presume you had CONFIG_FAIR_USER_SCHED turned on too?
> 
> Yes. It seems to be automatically turned on whenever FAIR_GROUP is
> turned on. Considering how bad the behaviour is for a standard desktop
> configuration, I'd be tempted to say to change it to default n.

If I recall, CONFIG_FAIR_USER_SCHED was turned on as default at the same
time as CONFIG_FAIR_GROUP_SCHED as a means to flesh out fair-group
scheduler bugs. Also at that time, CONFIG_FAIR_CGROUP_SCHED was not
available in mainline as the second option for grouping tasks.

Going forward, I am of the favor to turn off CONFIG_FAIR_USER_SCHED by default, 
but turning on CONFIG_FAIR_GROUP_SCHED + CONFIG_FAIR_CGROUP_SCHED on by default.

That way all tasks belong to same group by default unless admin explicitly 
creates groups and moves around tasks between them. This will be good for 
desktop user who may choose to keep all tasks in one group by default, but also 
giving him/her the flexibility of exploiting fair-group scheduler by creating 
custom task groups and adjusting their cpu shares (for ex: kernel compile group 
or multi-media group). If someone still needs the fair-user scheduler (as 
provided by CONFIG_FAIR_USER_SCHED), they can still get it with 
CONFIG_FAIR_CGROUP_SCHED by running a daemon [1] that dynamically moves around 
tasks into different task group based on userid.

Ingo/Peter, what do you think?

> > Also were the
> > dd process and the niced processes running under different user ids? If
> > so, that is expected behavior, that we divide CPU equally among
> > users first and then among the processes within each user.
> 
> They were different users and that behaviour seems to be a very stupid
> default behaviour for a desktop machine. Take this situation:
> 
>  - X running as root
>  - User apps running as "user"
>  - Background crap (indexing daemons etc...) running as their own user
> or nobody
> 
> Unless you can get some kind of grouping based on user sessions
> including suid binaries, X etc... I think this shouldn't default y in
> Kconfig.

yes, see above.

> Not that it seems that Michel reported far worse behaviour than what I
> saw, including pretty hickup'ish X behaviour even without the fair group
> scheduler compared to 2.6.23. It might be because he's running X niced
> to -1 (I leave X at 0 and let the scheduler deal with it in general)
> though.

Hmm ..with X niced to -1, it should get more cpu power leading to a
better desktop experience.

Michel,
You had reported that commit 810e95ccd58d91369191aa4ecc9e6d4a10d8d0c8 
was the cause for this bad behavior. Do you see behavior change (from good->bad)
immediately after applying that patch during your bisect process?

> > 2. Keep the niced tasks running under a non-root uid, but increase root 
> > users 
> >cpu share.
> > # echo 8192 > /sys/kernel/uids/0/cpu_share
> > 
> >This should bump up root user's priority for running on CPU and also 
> >give a better desktop experience.
> 
> Allright, that's something that might need to be set by default by the
> kernel ... as it will take some time to have knowledge of those knobs to
> percolate to distros. Too bad you can't do the opposite by default for
> "nobody" as there's no standard uid for it.
> 
> > The group scheduler's SMP-load balance in 2.6.24 is not the best it
> > could be. sched-devel has a better load balancer, which I am presuming
> > will go into 2.6.25 soon.
> > 
> > In this case, I suspect that's not the issue.  If X and the niced processes 
> > are 
> > running under different uids, this (niced processes getting more cpu power) 
> > is 
> > on expected lines. Will wait for Ben to confirm this. 
> 
> I would suggest turning the fair group scheduler to default n in stable
> for now.

I would prefer to have CONFIG_FAIR_GROUP_SCHED +
CONFIG_FAIR_CGROUP_SCHED on by default. Can you pls let me know how you
think is the desktop experience with that combination?

Reference:

1. http://article.gmane.org/gmane.linux.kernel/553267


-- 
Regards,
vatsa
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: ppc32: Weird process scheduling behaviour with 2.6.24-rc

2008-01-26 Thread Srivatsa Vaddagiri
On Sat, Jan 26, 2008 at 03:13:54PM +1100, Benjamin Herrenschmidt wrote:
> > Ben,
> > I presume you had CONFIG_FAIR_USER_SCHED turned on too?
> 
> Yes. It seems to be automatically turned on whenever FAIR_GROUP is
> turned on. Considering how bad the behaviour is for a standard desktop
> configuration, I'd be tempted to say to change it to default n.

If I recall, CONFIG_FAIR_USER_SCHED was turned on as default at the same
time as CONFIG_FAIR_GROUP_SCHED as a means to flesh out fair-group
scheduler bugs. Also at that time, CONFIG_FAIR_CGROUP_SCHED was not
available in mainline as the second option for grouping tasks.

Going forward, I am of the favor to turn off CONFIG_FAIR_USER_SCHED by default, 
but turning on CONFIG_FAIR_GROUP_SCHED + CONFIG_FAIR_CGROUP_SCHED on by default.

That way all tasks belong to same group by default unless admin explicitly 
creates groups and moves around tasks between them. This will be good for 
desktop user who may choose to keep all tasks in one group by default, but also 
giving him/her the flexibility of exploiting fair-group scheduler by creating 
custom task groups and adjusting their cpu shares (for ex: kernel compile group 
or multi-media group). If someone still needs the fair-user scheduler (as 
provided by CONFIG_FAIR_USER_SCHED), they can still get it with 
CONFIG_FAIR_CGROUP_SCHED by running a daemon [1] that dynamically moves around 
tasks into different task group based on userid.

Ingo/Peter, what do you think?

> > Also were the
> > dd process and the niced processes running under different user ids? If
> > so, that is expected behavior, that we divide CPU equally among
> > users first and then among the processes within each user.
> 
> They were different users and that behaviour seems to be a very stupid
> default behaviour for a desktop machine. Take this situation:
> 
>  - X running as root
>  - User apps running as "user"
>  - Background crap (indexing daemons etc...) running as their own user
> or nobody
> 
> Unless you can get some kind of grouping based on user sessions
> including suid binaries, X etc... I think this shouldn't default y in
> Kconfig.

yes, see above.

> Not that it seems that Michel reported far worse behaviour than what I
> saw, including pretty hickup'ish X behaviour even without the fair group
> scheduler compared to 2.6.23. It might be because he's running X niced
> to -1 (I leave X at 0 and let the scheduler deal with it in general)
> though.

Hmm ..with X niced to -1, it should get more cpu power leading to a
better desktop experience.

Michel,
You had reported that commit 810e95ccd58d91369191aa4ecc9e6d4a10d8d0c8 
was the cause for this bad behavior. Do you see behavior change (from good->bad)
immediately after applying that patch during your bisect process?

> > 2. Keep the niced tasks running under a non-root uid, but increase root 
> > users 
> >cpu share.
> > # echo 8192 > /sys/kernel/uids/0/cpu_share
> > 
> >This should bump up root user's priority for running on CPU and also 
> >give a better desktop experience.
> 
> Allright, that's something that might need to be set by default by the
> kernel ... as it will take some time to have knowledge of those knobs to
> percolate to distros. Too bad you can't do the opposite by default for
> "nobody" as there's no standard uid for it.
> 
> > The group scheduler's SMP-load balance in 2.6.24 is not the best it
> > could be. sched-devel has a better load balancer, which I am presuming
> > will go into 2.6.25 soon.
> > 
> > In this case, I suspect that's not the issue.  If X and the niced processes 
> > are 
> > running under different uids, this (niced processes getting more cpu power) 
> > is 
> > on expected lines. Will wait for Ben to confirm this. 
> 
> I would suggest turning the fair group scheduler to default n in stable
> for now.

I would prefer to have CONFIG_FAIR_GROUP_SCHED +
CONFIG_FAIR_CGROUP_SCHED on by default. Can you pls let me know how you
think is the desktop experience with that combination?

Reference:

1. http://article.gmane.org/gmane.linux.kernel/553267


-- 
Regards,
vatsa
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: ppc32: Weird process scheduling behaviour with 2.6.24-rc

2008-01-26 Thread Srivatsa Vaddagiri
On Sat, Jan 26, 2008 at 04:15:52PM +1100, Benjamin Herrenschmidt wrote:
> > Michel,
> > You had reported that commit 810e95ccd58d91369191aa4ecc9e6d4a10d8d0c8 
> > was the cause for this bad behavior. Do you see behavior change (from 
> > good->bad)
> > immediately after applying that patch during your bisect process?
> 
> Also Michel, double check your .config in both cases.

And also Michel whether CONFIG_FAIR_GROUP_SCHED + CONFIG_FAIR_CGROUP_SCHED
gives more or less same desktop exp as !CONFIG_FAIR_GROUP_SCHED pls!

> > I would prefer to have CONFIG_FAIR_GROUP_SCHED +
> > CONFIG_FAIR_CGROUP_SCHED on by default. Can you pls let me know how you
> > think is the desktop experience with that combination?
> 
> I'm going to give that a try but unfortunately, it will have to wait
> until I'm back from LCA in a bit more than a week.

-- 
Regards,
vatsa
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: ppc32: Weird process scheduling behaviour with 2.6.24-rc

2008-01-28 Thread Srivatsa Vaddagiri
On Mon, Jan 28, 2008 at 10:14:33AM +0100, Michel Dänzer wrote:
> > >   * With CONFIG_FAIR_USER_SCHED enabled, X becomes basically
> > > unusable with a niced CPU hog, with or without top running. I
> > > don't know when this started, possibly when this option was
> > > first introduced.
> > 
> > Srivatsa found an issue that might explain the very bad behaviour under
> > group scheduling. But I gather you're not at all interested in this
> > feature?
> 
> That's right, but it's good to hear you have a lead there as well, and
> if you can't find any interested testers, let me know and I'll try.

Michel,
Thanks for offering to test! The issue I found wrt preemption latency 
(when FAIR_USER_SCHED is turned on) is explained here:

http://marc.info/?l=linux-kernel&m=120148675326287

Does the patch in that URL help bring FAIR_USER_SCHED interactivity to the same
level as !FAIR_USER_SCHED?

-- 
Regards,
vatsa
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: ppc32: Weird process scheduling behaviour with 2.6.24-rc

2008-01-28 Thread Srivatsa Vaddagiri
On Mon, Jan 28, 2008 at 01:32:53PM +0100, Ingo Molnar wrote:
> * Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> 
> > >   * With CONFIG_FAIR_USER_SCHED disabled, there are severe
> > > interactivity hickups with a niced CPU hog and top running. This
> > > started with commit 810e95ccd58d91369191aa4ecc9e6d4a10d8d0c8. 
> > 
> > The revert at the bottom causes the wakeup granularity to shrink for + 
> > nice and to grow for - nice. That is, it becomes easier to preempt a + 
> > nice task, and harder to preempt a - nice task.
> 
> i think it would be OK to do half of this: make it easier to preempt a 
> +nice task.

Hmm .. I doubt whether that would help Michel's case, as he seems to be running
+niced tasks and having problems getting control over his desktop.

Something is basically wrong here ..

> Michel, do you really need the -nice portion as well? It's 
> not a problem to super-preempt positively reniced tasks, but it can be 
> quite annoying if negatively reniced tasks have super-slices.

-- 
Regards,
vatsa
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev