On Tue, May 13, 2025 at 12:25:47AM +0000, Taylor R Campbell wrote: > > Date: Mon, 5 May 2025 18:08:19 +0200 > > From: Manuel Bouyer <bou...@antioche.eu.org> > > > > still trying to debug panics/hangs on a heavily loaded web server > > What kernel version?
NetBSD 10.1_STABLE, sorry. I opened kern/59411 about it > > > I got a hard hang; > > What does `hard hang' mean? Is there there a heartbeat panic? Can No heartbeat here (it's only in HEAD, right ?) All activity stop (network, or serial console) but I can enter ddb. > you share the full output of ps, ps/w, and show all tstiles? And can > you show the stack traces for all CPUs with `mach cpu N'? I'll try to catch this next time. But there's no process in tstile state. > > > db{0}> mach cpu 2 > > using CPU 2 > > db{0}> tr > > _kernel_lock() at netbsd:_kernel_lock+0xd5 > > mb_drain() at netbsd:mb_drain+0x17 > > pool_grow() at netbsd:pool_grow+0x3b9 > > pool_get() at netbsd:pool_get+0x3c7 > > [...] > > > > I wonder if we can have a deadlock here: CPU 2 holds mbuf pool's lock and > > tries to get _kernel_lock(). It looks like the softint thread on CPU 0 > > holds the kernel_lock (as it's not running with NET_MPSAFE) and tries > > to get the mbuf pool's lock. > > This deadlock doesn't make sense because we drop the pool lock around > the drain hook (mb_drain): > > 1129 /* > 1130 * Since the drain hook is going to free things > 1131 * back to the pool, unlock, call the hook, > re-lock, > 1132 * and check the hardlimit condition again. > 1133 */ > 1134 mutex_exit(&pp->pr_lock); > 1135 (*pp->pr_drain_hook)(pp->pr_drain_hook_arg, > flags); > 1136 mutex_enter(&pp->pr_lock); > 1137 if (pp->pr_nout < pp->pr_hardlimit) > 1138 goto startover; > > https://nxr.netbsd.org/xref/src/sys/kern/subr_pool.c?r=1.293#1129 That's true for pool_get(), but not for pool_allocator_alloc(). -- Manuel Bouyer <bou...@antioche.eu.org> NetBSD: 26 ans d'experience feront toujours la difference --