Re: deadlock between KERNEL_LOCK and a mutex ?

Manuel Bouyer Tue, 13 May 2025 00:17:48 -0700

On Tue, May 13, 2025 at 12:25:47AM +0000, Taylor R Campbell wrote:
> > Date: Mon, 5 May 2025 18:08:19 +0200
> > From: Manuel Bouyer <bou...@antioche.eu.org>
> > 
> > still trying to debug panics/hangs on a heavily loaded web server
> 
> What kernel version?


NetBSD 10.1_STABLE, sorry.
I opened kern/59411 about it

> 
> > I got a hard hang;
> 
> What does `hard hang' mean?  Is there there a heartbeat panic?  Can

No heartbeat here (it's only in HEAD, right ?) All activity
stop (network, or serial console) but I can enter ddb.

> you share the full output of ps, ps/w, and show all tstiles?  And can
> you show the stack traces for all CPUs with `mach cpu N'?

I'll try to catch this next time. But there's no process in tstile
state.

> 
> > db{0}> mach cpu 2
> > using CPU 2
> > db{0}> tr
> > _kernel_lock() at netbsd:_kernel_lock+0xd5
> > mb_drain() at netbsd:mb_drain+0x17    
> > pool_grow() at netbsd:pool_grow+0x3b9 
> > pool_get() at netbsd:pool_get+0x3c7   
> > [...]
> > 
> > I wonder if we can have a deadlock here: CPU 2 holds mbuf pool's lock and
> > tries to get _kernel_lock(). It looks like the softint thread on CPU 0
> > holds the kernel_lock (as it's not running with NET_MPSAFE) and tries
> > to get the mbuf pool's lock.
> 
> This deadlock doesn't make sense because we drop the pool lock around
> the drain hook (mb_drain):
> 
>    1129                       /*
>    1130                        * Since the drain hook is going to free things
>    1131                        * back to the pool, unlock, call the hook, 
> re-lock,
>    1132                        * and check the hardlimit condition again.
>    1133                        */
>    1134                       mutex_exit(&pp->pr_lock);
>    1135                       (*pp->pr_drain_hook)(pp->pr_drain_hook_arg, 
> flags);
>    1136                       mutex_enter(&pp->pr_lock);
>    1137                       if (pp->pr_nout < pp->pr_hardlimit)
>    1138                               goto startover;
> 
> https://nxr.netbsd.org/xref/src/sys/kern/subr_pool.c?r=1.293#1129


That's true for pool_get(), but not for pool_allocator_alloc().

-- 
Manuel Bouyer <bou...@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--

Re: deadlock between KERNEL_LOCK and a mutex ?

Reply via email to