Hi Michael, On Mon, Sep 24, 2018 at 05:00:42PM +1000, Michael Ellerman wrote: > Nathan Fontenot <nf...@linux.vnet.ibm.com> writes: > > On 09/18/2018 05:32 AM, Gautham R Shenoy wrote: > >> Hi Nathan, > >> On Tue, Sep 18, 2018 at 1:05 AM Nathan Fontenot > >> <nf...@linux.vnet.ibm.com> wrote: > >>> > >>> When performing partition migrations all present CPUs must be online > >>> as all present CPUs must make the H_JOIN call as part of the migration > >>> process. Once all present CPUs make the H_JOIN call, one CPU is returned > >>> to make the rtas call to perform the migration to the destination system. > >>> > >>> During testing of migration and changing the SMT state we have found > >>> instances where CPUs are offlined, as part of the SMT state change, > >>> before they make the H_JOIN call. This results in a hung system where > >>> every CPU is either in H_JOIN or offline. > >>> > >>> To prevent this this patch disables CPU hotplug during the migration > >>> process. > >>> > >>> Signed-off-by: Nathan Fontenot <nf...@linux.vnet.ibm.com> > >>> --- > >>> arch/powerpc/kernel/rtas.c | 2 ++ > >>> 1 file changed, 2 insertions(+) > >>> > >>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c > >>> index 8afd146bc9c7..2c7ed31c736e 100644 > >>> --- a/arch/powerpc/kernel/rtas.c > >>> +++ b/arch/powerpc/kernel/rtas.c > >>> @@ -981,6 +981,7 @@ int rtas_ibm_suspend_me(u64 handle) > >>> goto out; > >>> } > >>> > >>> + cpu_hotplug_disable(); > >> > >> So, some of the onlined CPUs ( via > >> rtas_online_cpus_mask(offline_mask);) can go still offline, > >> if the userspace issues an offline command, just before we execute > >> cpu_hotplug_disable(). > >> > >> So we are narrowing down the race, but it still exists. Am I missing > >> something ? > > > > You're correct, this narrows the window in which a CPU can go offline. > > > > In testing with this patch we have not been able to re-create the failure > > but > > there is still a small window. > > Well let's close it. > > We just need to check that all present CPUs are online after we've > called cpu_hotplug_disable() don't we?
Yes. However, we cannot use the cpu_up() API to bring the offline CPUs online, since will return with an -EBUSY if CPU-Hotplug has been disabled. _cpu_up() works, but it is (understandably) a static function in kernel/cpu.c So, we might need a new APIs along the lines of disable_nonboot_cpus()/enable_nonboot_cpus() that is currently being used by the suspend subsystem, only that we would need the APIs to - Disable hotplug and online all the CPUs in an atomic fashion. Would be good if the API returns the cpumask of CPUs which were offline, which were brought online by this API. - Restore the state of the machine by offlining the CPUs which we brought online, and enable hotplug again. > > cheers >