Hello
still trying to debug panics/hangs on a heavily loaded web server

I got a hard hang; and in the stack trace on the various CPUs I did spot this:
CPU 0 (the one where I could enter ddb from console):
Stopped in pid 0.3 (system) at  netbsd:breakpoint+0x5:  leave
breakpoint() at netbsd:breakpoint+0x5
comintr() at netbsd:comintr+0x7e0
intr_wrapper() at netbsd:intr_wrapper+0x4b
Xhandle_ioapic_edge2() at netbsd:Xhandle_ioapic_edge2+0x6f
--- interrupt ---
mutex_vector_enter() at netbsd:mutex_vector_enter+0x3f0
pool_get() at netbsd:pool_get+0x69
pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x139
pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x233
m_get() at netbsd:m_get+0x37
m_copy_internal() at netbsd:m_copy_internal+0x13e
tcp4_segment() at netbsd:tcp4_segment+0x1f9
ip_tso_output() at netbsd:ip_tso_output+0x24
ip_output() at netbsd:ip_output+0x18c4
tcp_output() at netbsd:tcp_output+0x165e
tcp_input() at netbsd:tcp_input+0xfd5
ipintr() at netbsd:ipintr+0x8f1
softint_dispatch() at netbsd:softint_dispatch+0x11c

And:
db{0}> mach cpu 2
using CPU 2
db{0}> tr
_kernel_lock() at netbsd:_kernel_lock+0xd5
mb_drain() at netbsd:mb_drain+0x17    
pool_grow() at netbsd:pool_grow+0x3b9 
pool_get() at netbsd:pool_get+0x3c7   
pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x139
pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x233
m_get() at netbsd:m_get+0x37
m_gethdr() at netbsd:m_gethdr+0x9     
sosend() at netbsd:sosend+0x3d4       
soo_write() at netbsd:soo_write+0x2f  
dofilewrite() at netbsd:dofilewrite+0x80
sys_write() at netbsd:sys_write+0x49  
syscall() at netbsd:syscall+0x196     


I wonder if we can have a deadlock here: CPU 2 holds mbuf pool's lock and
tries to get _kernel_lock(). It looks like the softint thread on CPU 0
holds the kernel_lock (as it's not running with NET_MPSAFE) and tries
to get the mbuf pool's lock.

In this specific case, CPU 0 won't sleep (which would release the kernel_lock)
because the mbuf pool's lock owner is spinning on CPU 2.

Other CPUs are also trying to get the kernel_lock or the mbuf's pool lock.
Several are in:
mutex_vector_enter() at netbsd:mutex_vector_enter+0x209
tcp_timer_rexmt() at netbsd:tcp_timer_rexmt+0x28
callout_softclock() at netbsd:callout_softclock+0xd2
softint_dispatch() at netbsd:softint_dispatch+0x11c

One is doing:
db{0}> tr
_kernel_lock() at netbsd:_kernel_lock+0xd5
mb_drain() at netbsd:mb_drain+0x17
pool_grow() at netbsd:pool_grow+0x3b9
pool_get() at netbsd:pool_get+0x3c7
pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x139
pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x233
m_clget() at netbsd:m_clget+0x2b
sosend() at netbsd:sosend+0x489
soo_write() at netbsd:soo_write+0x2f
dofilewrite() at netbsd:dofilewrite+0x80
sys_write() at netbsd:sys_write+0x49
syscall() at netbsd:syscall+0x196

but at firt glance it's not part of the deadlock, it's the only stack trace
related to mbuf clusters.

Is this analysis correct ?

-- 
Manuel Bouyer <bou...@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--

Reply via email to