On 1 February 2013 19:03, Frederic Weisbecker <fweis...@gmail.com> wrote: > 2013/1/29 Vincent Guittot <vincent.guit...@linaro.org>: >> On my smp platform which is made of 5 cores in 2 clusters,I have the >> nr_busy_cpu field of sched_group_power struct that is not null when the >> platform is fully idle. The root cause seems to be: >> During the boot sequence, some CPUs reach the idle loop and set their >> NOHZ_IDLE flag while waiting for others CPUs to boot. But the nr_busy_cpus >> field is initialized later with the assumption that all CPUs are in the busy >> state whereas some CPUs have already set their NOHZ_IDLE flag. >> We clear the NOHZ_IDLE flag when nr_busy_cpus is initialized in order to >> have a coherent configuration. >> >> Signed-off-by: Vincent Guittot <vincent.guit...@linaro.org> >> --- >> kernel/sched/core.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index 257002c..fd41924 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -5884,6 +5884,7 @@ static void init_sched_groups_power(int cpu, struct >> sched_domain *sd) >> >> update_group_power(sd, cpu); >> atomic_set(&sg->sgp->nr_busy_cpus, sg->group_weight); >> + clear_bit(NOHZ_IDLE, nohz_flags(cpu)); > > So that's a real issue indeed. nr_busy_cpus was never correct. > > Now I'm still a bit worried with this solution. What if an idle task > started in smp_init() has not yet stopped its tick, but is about to do > so? The domains are not yet available to the task but the nohz flags > are. When it later restarts the tick, it's going to erroneously > increase nr_busy_cpus.
My 1st idea was to clear NOHZ_IDLE flag and nr_busy_cpus in init_sched_groups_power instead of setting them as it is done now. If a CPU enters idle during the init sequence, the flag is already cleared, and nohz_flags and nr_busy_cpus will stay synced and cleared while a NULL sched_domain is attached to the CPU thanks to patch 2. This should solve all use cases ? > > It probably won't happen in practice. But then there is more: sched > domains can be concurrently rebuild anytime, right? So what if we > call set_cpu_sd_state_idle() and decrease nr_busy_cpus while the > domain is switched concurrently. Are we having a new sched group along > the way? If so we have a bug here as well because we can have > NOHZ_IDLE set but nr_busy_cpus accounting the CPU. When the sched_domain are rebuilt, we set a null sched_domain during the rebuild sequence and a new sched_group_power is created as well > > May be we need to set the per cpu nohz flags on the child leaf sched > domain? This way it's initialized and stored on the same RCU pointer > and we nohz_flags and nr_busy_cpus become sync. > > Also we probably still need the first patch of your previous round. > Because the current patch may introduce situations where we have idle > CPUs with NOHZ_IDLE flags cleared. _______________________________________________ linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev