During DLPAR operations, The newly added CPUs will start in halted mode. Kernel will then take sometime to initialize those cpu interally and start them using "start-cpu" rtas call. However if a kexec-crash is occurred in between this window (till the new cpu has been initialized), The kexec nmi will try to reset all-other-cpus from the crashing cpu, Which will lead to firmware starting the uninitialized cpus aswell. This will lead to kdump kernel to hang during bringup.
Sample Log: [175993.028231][ T1502] NIP [00007fffb953f394] 0x7fffb953f394 [175993.028314][ T1502] LR [00007fffb953f394] 0x7fffb953f394 [175993.028390][ T1502] --- interrupt: 3000 [ 5.519483][ T1] Processor 0 is stuck. [ 11.089481][ T1] Processor 1 is stuck. To Fix this, Only issue the system-reset hcall to CPUs that have actually been started by the kernel. Cc: Madhavan Srinivasan <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: Shrikanth Hegde <[email protected]> Cc: Nysal Jan K.A. <[email protected]> Cc: Vishal Chourasia <[email protected]> Cc: Ritesh Harjani <[email protected]> Cc: Sourabh Jain <[email protected]> Signed-off-by: Shivang Upadhyay <[email protected]> --- arch/powerpc/platforms/pseries/smp.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c index db99725e752b..e5518cf71094 100644 --- a/arch/powerpc/platforms/pseries/smp.c +++ b/arch/powerpc/platforms/pseries/smp.c @@ -173,10 +173,24 @@ static void dbell_or_ic_cause_ipi(int cpu) static int pseries_cause_nmi_ipi(int cpu) { - int hwcpu; + int hwcpu, k; if (cpu == NMI_IPI_ALL_OTHERS) { - hwcpu = H_SIGNAL_SYS_RESET_ALL_OTHERS; + + for_each_present_cpu(k) { + if (k != smp_processor_id()) { + hwcpu = get_hard_smp_processor_id(k); + + /* it is possible that cpu is present, + * but not started yet + */ + if (paca_ptrs[hwcpu]->cpu_start == 1) + plpar_signal_sys_reset(hwcpu); + } + } + + return 1; + } else { if (cpu < 0) { WARN_ONCE(true, "incorrect cpu parameter %d", cpu); -- 2.52.0
