On 09/18/2018 05:32 AM, Gautham R Shenoy wrote: > Hi Nathan, > On Tue, Sep 18, 2018 at 1:05 AM Nathan Fontenot > <nf...@linux.vnet.ibm.com> wrote: >> >> When performing partition migrations all present CPUs must be online >> as all present CPUs must make the H_JOIN call as part of the migration >> process. Once all present CPUs make the H_JOIN call, one CPU is returned >> to make the rtas call to perform the migration to the destination system. >> >> During testing of migration and changing the SMT state we have found >> instances where CPUs are offlined, as part of the SMT state change, >> before they make the H_JOIN call. This results in a hung system where >> every CPU is either in H_JOIN or offline. >> >> To prevent this this patch disables CPU hotplug during the migration >> process. >> >> Signed-off-by: Nathan Fontenot <nf...@linux.vnet.ibm.com> >> --- >> arch/powerpc/kernel/rtas.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c >> index 8afd146bc9c7..2c7ed31c736e 100644 >> --- a/arch/powerpc/kernel/rtas.c >> +++ b/arch/powerpc/kernel/rtas.c >> @@ -981,6 +981,7 @@ int rtas_ibm_suspend_me(u64 handle) >> goto out; >> } >> >> + cpu_hotplug_disable(); > > So, some of the onlined CPUs ( via > rtas_online_cpus_mask(offline_mask);) can go still offline, > if the userspace issues an offline command, just before we execute > cpu_hotplug_disable(). > > So we are narrowing down the race, but it still exists. Am I missing > something ?
You're correct, this narrows the window in which a CPU can go offline. In testing with this patch we have not been able to re-create the failure but there is still a small window. -Nathan > >> stop_topology_update(); >> >> /* Call function on all CPUs. One of us will make the >> @@ -995,6 +996,7 @@ int rtas_ibm_suspend_me(u64 handle) >> printk(KERN_ERR "Error doing global join\n"); >> >> start_topology_update(); >> + cpu_hotplug_enable(); >> >> /* Take down CPUs not online prior to suspend */ >> cpuret = rtas_offline_cpus_mask(offline_mask); >> > >