Hello still trying to debug panics/hangs on a heavily loaded web server I got a hard hang; and in the stack trace on the various CPUs I did spot this: CPU 0 (the one where I could enter ddb from console): Stopped in pid 0.3 (system) at netbsd:breakpoint+0x5: leave breakpoint() at netbsd:breakpoint+0x5 comintr() at netbsd:comintr+0x7e0 intr_wrapper() at netbsd:intr_wrapper+0x4b Xhandle_ioapic_edge2() at netbsd:Xhandle_ioapic_edge2+0x6f --- interrupt --- mutex_vector_enter() at netbsd:mutex_vector_enter+0x3f0 pool_get() at netbsd:pool_get+0x69 pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x139 pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x233 m_get() at netbsd:m_get+0x37 m_copy_internal() at netbsd:m_copy_internal+0x13e tcp4_segment() at netbsd:tcp4_segment+0x1f9 ip_tso_output() at netbsd:ip_tso_output+0x24 ip_output() at netbsd:ip_output+0x18c4 tcp_output() at netbsd:tcp_output+0x165e tcp_input() at netbsd:tcp_input+0xfd5 ipintr() at netbsd:ipintr+0x8f1 softint_dispatch() at netbsd:softint_dispatch+0x11c
And: db{0}> mach cpu 2 using CPU 2 db{0}> tr _kernel_lock() at netbsd:_kernel_lock+0xd5 mb_drain() at netbsd:mb_drain+0x17 pool_grow() at netbsd:pool_grow+0x3b9 pool_get() at netbsd:pool_get+0x3c7 pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x139 pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x233 m_get() at netbsd:m_get+0x37 m_gethdr() at netbsd:m_gethdr+0x9 sosend() at netbsd:sosend+0x3d4 soo_write() at netbsd:soo_write+0x2f dofilewrite() at netbsd:dofilewrite+0x80 sys_write() at netbsd:sys_write+0x49 syscall() at netbsd:syscall+0x196 I wonder if we can have a deadlock here: CPU 2 holds mbuf pool's lock and tries to get _kernel_lock(). It looks like the softint thread on CPU 0 holds the kernel_lock (as it's not running with NET_MPSAFE) and tries to get the mbuf pool's lock. In this specific case, CPU 0 won't sleep (which would release the kernel_lock) because the mbuf pool's lock owner is spinning on CPU 2. Other CPUs are also trying to get the kernel_lock or the mbuf's pool lock. Several are in: mutex_vector_enter() at netbsd:mutex_vector_enter+0x209 tcp_timer_rexmt() at netbsd:tcp_timer_rexmt+0x28 callout_softclock() at netbsd:callout_softclock+0xd2 softint_dispatch() at netbsd:softint_dispatch+0x11c One is doing: db{0}> tr _kernel_lock() at netbsd:_kernel_lock+0xd5 mb_drain() at netbsd:mb_drain+0x17 pool_grow() at netbsd:pool_grow+0x3b9 pool_get() at netbsd:pool_get+0x3c7 pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x139 pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x233 m_clget() at netbsd:m_clget+0x2b sosend() at netbsd:sosend+0x489 soo_write() at netbsd:soo_write+0x2f dofilewrite() at netbsd:dofilewrite+0x80 sys_write() at netbsd:sys_write+0x49 syscall() at netbsd:syscall+0x196 but at firt glance it's not part of the deadlock, it's the only stack trace related to mbuf clusters. Is this analysis correct ? -- Manuel Bouyer <bou...@antioche.eu.org> NetBSD: 26 ans d'experience feront toujours la difference --