On Mon, Oct 12, 2015 at 07:34:02PM +0200, Oleg Nesterov wrote: > On 10/12, Peter Zijlstra wrote: > > > > On Sat, Oct 10, 2015 at 08:53:09PM +0200, Oleg Nesterov wrote: > > > I do not understand the cpu_active() check in select_fallback_rq(). > > > x86 doesn't need it, and the recent commit dd9d3843755d "sched: Fix > > > cpu_active_mask/cpu_online_mask race" documents the fact that on any > > > architecture we can ignore !active starting from CPU_ONLINE stage. > > > > > > But any possible reason why do we need this check in "fallback" must > > > equally apply to select_task_rq(). > > > > So the reason, from vague memory, is that we want to allow per-cpu > > threads to start/stop before/after active. > > I simply can't understand... To me it looks as if we can simply remove > the cpu_active() check in select_fallback_rq(). > > If we race with cpu_down(), cpu_active() is cleared by sched_cpu_inactive() > which is CPU_PRI_SCHED_INACTIVE = INT_MIN + 1 priority, so it seems that > only cpuset_cpu_inactive() can be called after that and this check is > obviously racy anyway.
Racy yes; however select_task_rq() is called with locks held, therefore preemption disabled. This serializes us against the sync_sched() in cpu_down(). And note that smpboot_park_threads() runs until after the sync_sched(). So you cannot add cpu_active() to select_task_rq() for it must allow per-cpu tasks to remain on the cpu. Also, I think smpboot_park_threads() is called too early, we can still run random tasks at that point which might rely on having the per-cpu tasks around. But we cannot run it later because once the stopper thread runs, the per-cpu threads cannot schedule to park themselves :/ Lets keep an eye on Thomas' rework to see if this gets sorted. > So why we can't simply remove select_fallback_rq()->cpu_active() ? It would allow migration onto a CPU that's known to be going away. That doesn't make sense. > > active 'should' really only govern load-balancer bits or so. > > OK, I don't understand the usage of cpu_active_mask in kernel/sched/, > and of course I could easily miss something else. But I doubt very > much this check can save us from something bad simply because it is > racy. But safely so; if we observe active, it must stay 'active' until we enable preemption again. The whole point of active is to limit the load-balancer; such that we can rebuild the sched-domains after the fact. We used to have to rebuild to the sched-domains early, which got into trouble (forgot details, again). And we had to have a new mask, because online was too late, there was (forgot details) a state where we were still online but new are not welcome. Also, it makes sense to stop accepting new tasks before you take out the old ones. > Yes, we also have synchronize_sched() after CPU_DOWN_PREPARE, but > the only thing we do before stop_machine() is smpboot_park_threads(). > So this can help (say) set_cpus_allowed_ptr() which uses active_mask, > but I don't see how this can connect to ttwu paths. Say you have a task A running on CPU4, it goes to sleep. We unplug CPU4. We then commence unplugging CPU3, concurrently we wake A. A finds that its old CPU (4) is no longer online and ends up in fallback looking for a new one. Without the cpu_active() test in fallback, we could place A on CPU3, which is on its way down, just not quite dead. Even if this is not strictly buggy its a daft thing to do and tempting fate. > And again. If there is some reason we can not do this, say, because > ipi to non-active CPU can trigger a warning, or something else, then > we can hit the same problem because select_task_rq() does not check > cpu_active(). The kernel threads like stoppers/workers are probably > fine in any case, but userspace can trigger the race with cpu_up/down. So the only 'race' is observing active while we're stuck in sync_sched(), which is meant to be safe. It also provides us with a known spot after which we're sure no new tasks will come onto our dying CPU. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/