Hi, I'm getting the following kernel panics on recent 11-STABLE (r317883): spin lock 0xffffffff81df43d0 (smp rendezvous) held by 0xfffff8019c7a7000 (tid 100845) too long timeout stopping cpus panic: spin lock held too long cpuid = 12 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe3e64caf2a0 vpanic() at vpanic+0x186/frame 0xfffffe3e64caf320 panic() at panic+0x43/frame 0xfffffe3e64caf380 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0x311/frame 0xfffffe3e64caf3f0 smp_targeted_tlb_shootdown() at smp_targeted_tlb_shootdown+0x101/frame 0xfffffe3e64caf470 smp_masked_invlpg_range() at smp_masked_invlpg_range+0x50/frame 0xfffffe3e64caf4a0 pmap_invalidate_range() at pmap_invalidate_range+0x196/frame 0xfffffe3e64caf4e0 vm_thread_stack_dispose() at vm_thread_stack_dispose+0x2f/frame 0xfffffe3e64caf530 thread_free() at thread_free+0x39/frame 0xfffffe3e64caf550 thread_reap() at thread_reap+0x10e/frame 0xfffffe3e64caf570 proc_reap() at proc_reap+0x6bd/frame 0xfffffe3e64caf5b0 proc_to_reap() at proc_to_reap+0x48c/frame 0xfffffe3e64caf600 kern_wait6() at kern_wait6+0x49e/frame 0xfffffe3e64caf6b0 sys_wait4() at sys_wait4+0x78/frame 0xfffffe3e64caf8a0 amd64_syscall() at amd64_syscall+0x6c4/frame 0xfffffe3e64cafa30 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe3e64cafa30
The panics occur sporadically and seemingly randomly. They occur on multiple machines with the same configuration, and on older revisions of 11-STABLE as well as r317883 (I cannot easily bisect this down to a specific commit given that it can take days or sometimes weeks for the panic to occur). Note that my kernel is compiled with "options IPSEC", but that does not seem to be relevant. My understanding of the overall problem is that an IPI is sent to all cores by smp_rendezvous_cpus, but that some of the cores do not respond to it (or at least do not respond to it correctly and in time). More specifically, I can find 61 threads that seem to be blocked in cpustop_handler on an atomic operation to indicate that they have stopped in response to the IPI. This operation is CPU_SET_ATOMIC(cpu, &stopped_cpus); CPU_SET_ATOMIC boils down to a call to "atomic_set_long" (not sure where that is itself defined for amd64). Either there is a line numbering problem, or this suggests that perhaps there is a deadlock at this step (unless there's some sort of livelock and CPU_SET_ATOMIC is the place where the threads spend most of their time). Looking at the thread backtraces as a whole, I can see the one that triggered the panic, the 61 that are in cpustop_handler, and a lot of sleeping threads. I guess each core should have an active thread, and this is an architecture with 64 cores, so that leaves 64 - (1 + 61) = 2 cores that are unaccounted for. Interestingly, for 2 different panics for which I have a vmcore file these numbers are the same. For these two panics, the IDs of the cores that are neither in cpustop_handler nor calling the smp rendez vous are not the same (#18 and #19 in one instance, #44 and #45 in the other instance) but in both instances the IDs are consecutive; perhaps that's relevant. I've uploaded a full set of backtraces at https://gist.github.com/cinquin/d63cdf9de01b0b8033c47128868f2d38 and a dmesg at https://gist.github.com/cinquin/17c0cf6ac7832fd1b683488d08d3e69b (in short, the machine is a 4 processor Opteron 6274 system). I'm pasting below an example backtrace of the threads in stopcpu_handler. Any suggestions as to what the problem might be and how to solve it? I've saved the core and can provide further information. Thanks Olivier Cinquin A total of 61 threads are along the lines of #0 cpustop_handler () at /usr/src/sys/x86/x86/mp_x86.c:1275 #1 0xffffffff8105a6f5 in ipi_nmi_handler () at /usr/src/sys/x86/x86/mp_x86.c:1231 #2 0xffffffff80ef740a in trap (frame=0xfffffe3e687f7f30) at /usr/src/sys/amd64/amd64/trap.c:185 #3 0xffffffff80edbd03 in nmi_calltrap () at /usr/src/sys/amd64/amd64/exception.S:510 ... (kgdb) frame 0 #0 cpustop_handler () at /usr/src/sys/x86/x86/mp_x86.c:1275 1275 CPU_SET_ATOMIC(cpu, &stopped_cpus); (kgdb) list 1270 cpu = PCPU_GET(cpuid); 1271 1272 savectx(&stoppcbs[cpu]); 1273 1274 /* Indicate that we are stopped */ 1275 CPU_SET_ATOMIC(cpu, &stopped_cpus); 1276 1277 /* Wait for restart */ 1278 while (!CPU_ISSET(cpu, &started_cpus)) 1279 ia32_pause(); (kgdb) p stopped_cpus $1 = {__bits = 0xffffffff81df4368} (kgdb) p started_cpus $2 = {__bits = 0xffffffff81df4418} _______________________________________________ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"