On 10/14, Peter Zijlstra wrote: > > On Mon, Oct 12, 2015 at 07:34:02PM +0200, Oleg Nesterov wrote: > > > > I simply can't understand... To me it looks as if we can simply remove > > the cpu_active() check in select_fallback_rq(). > > > > If we race with cpu_down(), cpu_active() is cleared by sched_cpu_inactive() > > which is CPU_PRI_SCHED_INACTIVE = INT_MIN + 1 priority, so it seems that > > only cpuset_cpu_inactive() can be called after that and this check is > > obviously racy anyway. > > Racy yes; however select_task_rq() is called with locks held, therefore > preemption disabled. This serializes us against the sync_sched() in > cpu_down().
Yes, I even mentioned this below. And this only means that if you see cpu_active() == T you know that smpboot_park_threads() can't be called until preempt_disable(). cpu_active() can become false right after the check, preempt_disable() can't protect from __cpu_notify(CPU_DOWN_PREPARE). > And note that smpboot_park_threads() runs until after the sync_sched(). > So you cannot add cpu_active() to select_task_rq() for it must allow > per-cpu tasks to remain on the cpu. Yes, yes, this is why this patch triggers BUG_ON() before ->park() in smpboot_thread_fn(). > > So why we can't simply remove select_fallback_rq()->cpu_active() ? > > It would allow migration onto a CPU that's known to be going away. That > doesn't make sense. But this is not that bad? This is very unlikely and CPU_DYING will migrate the woken task again. > > OK, I don't understand the usage of cpu_active_mask in kernel/sched/, > > and of course I could easily miss something else. But I doubt very > > much this check can save us from something bad simply because it is > > racy. > > But safely so; if we observe active, it must stay 'active' until we > enable preemption again. I don't undertand what "stay active" actually means here. cpu_active() is not stable. But probably this doesn't matter. > > Yes, we also have synchronize_sched() after CPU_DOWN_PREPARE, but > > the only thing we do before stop_machine() is smpboot_park_threads(). > > So this can help (say) set_cpus_allowed_ptr() which uses active_mask, > > but I don't see how this can connect to ttwu paths. > > Say you have a task A running on CPU4, it goes to sleep. We unplug CPU4. > We then commence unplugging CPU3, concurrently we wake A. A finds that > its old CPU (4) is no longer online and ends up in fallback looking for > a new one. > > Without the cpu_active() test in fallback, we could place A on CPU3, > which is on its way down, just not quite dead. The same can happen with the cpu_active() check we currently have. And note again that CPU_PRI_SCHED_INACTIVE == INT_MIN + 1. When sched_cpu_inactive() clears cpu_active() all other callbacks (except cpuset_cpu_inactive() were already fired. > Even if this is not strictly buggy its a daft thing to do and tempting > fate. See above... And if we remove this check from select_fallback_rq() we can revert dd9d3843755da95f6 "sched: Fix cpu_active_mask/cpu_online_mask race". And revert 875ebe94 "powerpc/smp: Wait until secondaries are active & online". And a1307bba "s390/smp: wait until secondaries are active & online". > > And again. If there is some reason we can not do this, say, because > > ipi to non-active CPU can trigger a warning, or something else, then > > we can hit the same problem because select_task_rq() does not check > > cpu_active(). The kernel threads like stoppers/workers are probably > > fine in any case, but userspace can trigger the race with cpu_up/down. > > So the only 'race' is observing active while we're stuck in > sync_sched(), Why? select_task_rq() will see cpu_online() == T after sync_sched(). > which is meant to be safe. Yes, because that damn cpu_active() check doesn't look strictly necessary ;) Or I misunderstood. > It also provides us with a > known spot after which we're sure no new tasks will come onto our dying > CPU. You mean that ttwu(task) can't migrate this task to the dying CPU after synchronize_rcu() and before stop_machine(), this is true. OK, lets keep this check if you dislike the idea to remove it. But to me the very fact that select_task_rq() and select_fallback_rq() use different checks looks just wrong. Even if everything happens to work. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/