Michael Ellerman <m...@ellerman.id.au> writes: > Anton has a busy ppc64le KVM box where guests sometimes hit the infamous > "kernel BUG at kernel/smpboot.c:134!" issue during boot: > > BUG_ON(td->cpu != smp_processor_id()); > > Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops > output confirms it: > > CPU: 0 > Comm: watchdog/130 > > The problem is that we aren't ensuring the CPU active bit is set for the > secondary before allowing the master to continue on. The master unparks > the secondary CPU's kthreads and the scheduler looks for a CPU to run > on. It calls select_task_rq() and realises the suggested CPU is not in > the cpus_allowed mask. It then ends up in select_fallback_rq(), and > since the active bit isnt't set we choose some other CPU to run on. > > This seems to have been introduced by 6acbfb96976f "sched: Fix hotplug > vs. set_cpus_allowed_ptr()", which changed from setting active before > online to setting active after online. However that was in turn fixing a > bug where other code assumed an active CPU was also online, so we can't > just revert that fix. > > The simplest fix is just to spin waiting for both active & online to be > set. We already have a barrier prior to set_cpu_online() (which also > sets active), to ensure all other setup is completed before online & > active are set. > > Fixes: 6acbfb96976f ("sched: Fix hotplug vs. set_cpus_allowed_ptr()") > Signed-off-by: Michael Ellerman <m...@ellerman.id.au> > Signed-off-by: Anton Blanchard <an...@samba.org>
By building a gcov enabled skiboot, which makes OPAL_START_CPU a whole bunch slower (because gcov), I could really *really* reliably reproduce this. With this patch, I cannot. Tested-by: Stewart Smith <stew...@linux.vnet.ibm.com> _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev