Le 06/12/2018 à 12:31, Gautham R. Shenoy a écrit : > From: "Gautham R. Shenoy" <e...@linux.vnet.ibm.com> > > Currently running DLPAR offline/online operations in a loop on a > POWER9 system with SMT=off results in the following crash: > > [ 223.321032] cpu 112 (hwid 112) Ready to die... > [ 223.355963] Querying DEAD? cpu 113 (113) shows 2 > [ 223.356233] cpu 114 (hwid 114) Ready to die... > [ 223.356236] cpu 113 (hwid 113) Ready to die... > [ 223.356241] Bad kernel stack pointer 1faf6ca0 at 1faf6d50 > [ 223.356243] Oops: Bad kernel stack pointer, sig: 6 [#1] > [ 223.356244] LE SMP NR_CPUS=2048 NUMA pSeries > [ 223.356247] Modules linked in: > [ 223.356255] CPU: 114 PID: 0 Comm: swapper/114 Not tainted 4.20.0-rc3 #39 > [ 223.356258] NIP: 000000001faf6d50 LR: 000000001faf6d50 CTR: > 000000001ec6d06c > [ 223.356259] REGS: c00000001e5cbd30 TRAP: 0700 Not tainted (4.20.0-rc3) > [ 223.356260] MSR: 8000000000081000 <SF,ME> CR: 28000004 XER: 00000020 > [ 223.356263] CFAR: 000000001ec98590 IRQMASK: 8000000000009033 > GPR00: 000000001faf6d50 000000001faf6ca0 000000001ed1c448 > 00000267e6a273c3 > GPR04: 0000000000000000 00000000000000e0 000000000000dfe8 > 000000001faf6d30 > GPR08: 000000001faf6d28 00000267e6a273c3 000000001ec1b108 > 0000000000000000 > GPR12: 0000000001b6d998 c00000001eb55080 c0000003a1b8bf90 > 000000001eea3f40 > GPR16: 0000000000000000 c0000006fda85100 c00000000004c8b0 > c0000000013d5300 > GPR20: c0000006fda85300 0000000000000008 c0000000019d2cf8 > c0000000013d6888 > GPR24: 0000000000000072 c0000000013d688c 0000000000000002 > c0000000013d688c > GPR28: c0000000019cecf0 0000000000000390 0000000000000000 > 000000001faf6ca0 > [ 223.356281] NIP [000000001faf6d50] 0x1faf6d50 > [ 223.356281] LR [000000001faf6d50] 0x1faf6d50 > [ 223.356282] Call Trace: > [ 223.356283] Instruction dump: > [ 223.356285] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX > XXXXXXXX > [ 223.356286] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX > XXXXXXXX > [ 223.356290] ---[ end trace f46a4e046b564d1f ]--- > > This is due to multiple offlined CPUs (CPU 113 and CPU 114 above) > concurrently (within 3us) trying to make the rtas-call with the > "stop-self" token, something that is prohibited by the PAPR. > > The concurrent calls can happen as follows. > > o In dlpar_offline_cpu() we prod an offline CPU X (offline due to > SMT=off) and loop for 25 tries in pseries_cpu_die() querying if > the target CPU X has been stopped in RTAS. After 25 tries, we > prints the message "Querying DEAD? cpu X (X) shows 2" and return > to dlpar_offline_cpu(). Note that at this point CPU X has not yet > called rtas with the "stop-self" token, but can do so anytime now. > > o Back in dlpar_offline_cpu(), we prod the next offline CPU Y. CPU Y > promptly wakes up and calls RTAS with the "stop-self" token. > > o Before RTAS can stop CPU Y, CPU X also calls RTAS with the > "stop-self" token. > > The problem occurs due to the short number of tries (25) in > pseries_cpu_die() which covers 90% of the cases. For the remaining 10% > of the cases, it was observed that we would need to loop for a few > hundred iterations before the target CPU calls rtas with "stop-self" > token.Experiments show that each try takes roughly ~25us. > > In this patch we fix the problem by increasing the number of tries > from 25 to 4000 (which roughly corresponds to 100ms) before bailing > out and declaring that we have failed to observe the target CPU call > rtas-stop-self. This provides sufficient serialization to ensure that > there no concurrent rtas calls with "stop-self" token. > > Reported-by: Michael Bringmann <m...@linux.vnet.ibm.com> > Signed-off-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com>
This patch badly conflicts with 801980f64979 ("powerpc/pseries/hotplug-cpu: wait indefinitely for vCPU death") Is it still applicable ? If so, please rebase and resubmit. Thanks Christophe > --- > arch/powerpc/platforms/pseries/hotplug-cpu.c | 13 ++++++++++++- > 1 file changed, 12 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c > b/arch/powerpc/platforms/pseries/hotplug-cpu.c > index 2f8e621..c913c44 100644 > --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c > +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c > @@ -214,14 +214,25 @@ static void pseries_cpu_die(unsigned int cpu) > msleep(1); > } > } else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) { > + int max_tries = 4000; /* Roughly corresponds to 100ms */ > + u64 tb_before = mftb(); > > - for (tries = 0; tries < 25; tries++) { > + for (tries = 0; tries < max_tries; tries++) { > cpu_status = smp_query_cpu_stopped(pcpu); > if (cpu_status == QCSS_STOPPED || > cpu_status == QCSS_HARDWARE_ERROR) > break; > cpu_relax(); > } > + > + if (tries == max_tries) { > + u64 time_elapsed_us = div_u64(mftb() - tb_before, > + tb_ticks_per_usec); > + > + pr_warn("Offlined CPU %d isn't stopped by RTAS after > %llu us\n", > + cpu, time_elapsed_us); > + WARN_ON(1); > + } > } > > if (cpu_status != 0) {