On 21/02/24(Wed) 13:05, Claudio Jeker wrote: > On Tue, Feb 20, 2024 at 09:34:12PM +0100, Martin Pieuchot wrote: > > On 28/10/21(Thu) 05:45, Visa Hankala wrote: > > > On Wed, Oct 27, 2021 at 09:02:08PM -0400, Dave Voutila wrote: > > > > Dave Voutila <d...@sisu.io> writes: > > > > > > > > > Was tinkering on a bt(5) script for trying to debug an issue in vmm(4) > > > > > when I managed to start hitting a panic "wakeup: p_stat is 2" being > > > > > triggered by kqueue coming from the softnet kernel task. > > > > > > > > > > I'm running an amd64 kernel built from the tree today (latest CVS > > > > > commit > > > > > id UynQo1r7kLKA0Q2p) with VMM_DEBUG option set and the defaults from > > > > > GENERIC.MP. Userland is from the latest amd snap. > > > > > > > > > > To reproduce, I'm running a single OpenBSD-current guest under vmd(8) > > > > > which I'm targeting with the following trivial btrace script I was > > > > > working on to use for debugging something in vmm(4): > > > > > > > > > > tracepoint:sched:sleep / pid == $1 && tid == $2 /{ > > > > > printf("pid %d, tid %d slept %d!\n", pid, tid, nsecs); > > > > > } > > > > > > > > > > tracepoint:sched:wakeup / pid == $1 && tid == $2 /{ > > > > > printf("pid %d, tid %d awoke %d!\n", pid, tid, nsecs); > > > > > } > > > > > > > > Even easier reproduction: if you have 2 machines and can use tcpbench(1) > > > > between them, then while tcpbench is running target it with the above > > > > btrace script. I've found running the script, killing it with ctrl-c, > > > > and re-running it 2-3 times triggers the panic on my laptop. > > > > > > > > > > > > > > Both times this happened I was trying to sysupgrade the vmd(8) guest > > > > > while running the above btrace script. When I don't run the script, > > > > > there is no panic. > > > > > > > > > > Image of the full backtrace is here: https://imgur.com/a/swW1qoj > > > > > > > > > > Simple transcript of the call stack after the panic() call looks like: > > > > > > > > > > wakeup_n > > > > > kqueue_wakeup > > > > > knote > > > > > selwakekup > > > > > tun_enqueue > > > > > ether_output > > > > > ip_output > > > > > ip_forward > > > > > ip_input_if > > > > > ipv4_input > > > > > ether_input > > > > > if_input_process > > > > > > > > > > The other 3 cpu cores appeared to be in ipi handlers. (Image in that > > > > > imgur link) > > > > > > I think the problem is recursion within the scheduler. Trace points > > > invoke wakeup() directly, which is unsafe when done from within the > > > run queue routines. > > > > > > One approach to avoid the recursion is to defer the wakeup() with > > > a soft interrupt. The scheduler runs at an elevated system priority > > > level that blocks soft interrupts. The deferred wakeup() is issued once > > > the system priority level turns low enough. > > > > > > Unfortunately, the patch will get broken when someone adds trace points > > > to soft interrupt scheduling... > > > > Thanks for the report. Sorry for the delay. I'm now really interested > > in fixing this bug because I'm heavily using btrace(8) to analyse the > > scheduler and I hit this panic a couple of times. > > > > The problem is that `pnext' might no longer be on the sleepqueue after a > > tracepoint inside setrunnable() fired. Diff below fixes that by making > > wakeup_n() re-entrant. > > > > I'm not very interested in committing this diff because it relies on a > > recursive SCHED_LOCK(). Instead I'd prefer to split wakeup_n() in two > > stages: first unlink the threads then call setrunnable(). This approach > > will help us untangle the sleepqueue but needs a bit more shuffling, > > like moving unsleep() out of setrunnable()... > > > > Claudio, Visa do you agree? > > [...] > > Ugh, this is too much magic for my little brain. > I would prefer to unqueue the procs onto a new local list and have a > special wakeup_proc for them. The code around wakeup and unsleep will need > some changes to unlock it properly.
Yes, that is what I had in mind. Diff below does that and works for me(tm). The first loop removes threads from the sleep queue and the second put them on a runqueue and optionally call tracepoints. Note that unsleep() has been removed from setrunnable() to reduce the complexity of the code. Messing, sometimes, with the state of the sleepqueue down in this path doesn't help me get a simpler picture of this machinery. ok? Index: dev/pci/drm/drm_linux.c =================================================================== RCS file: /cvs/src/sys/dev/pci/drm/drm_linux.c,v retrieving revision 1.109 diff -u -p -r1.109 drm_linux.c --- dev/pci/drm/drm_linux.c 21 Jan 2024 13:36:40 -0000 1.109 +++ dev/pci/drm/drm_linux.c 22 Feb 2024 14:32:27 -0000 @@ -162,7 +162,7 @@ wake_up_process(struct proc *p) int s, rv; SCHED_LOCK(s); - rv = wakeup_proc(p, NULL, 0); + rv = wakeup_proc(p, 0); SCHED_UNLOCK(s); return rv; } Index: kern/kern_sig.c =================================================================== RCS file: /cvs/src/sys/kern/kern_sig.c,v retrieving revision 1.321 diff -u -p -r1.321 kern_sig.c --- kern/kern_sig.c 17 Jan 2024 22:22:25 -0000 1.321 +++ kern/kern_sig.c 22 Feb 2024 14:25:16 -0000 @@ -1200,6 +1200,7 @@ runfast: if (p->p_usrpri > PUSER) p->p_usrpri = PUSER; run: + unsleep(p); setrunnable(p); out: /* finally adjust siglist */ @@ -2104,6 +2105,7 @@ single_thread_set(struct proc *p, int fl if (q->p_flag & P_WEXIT) { if (mode == SINGLE_EXIT) { if (q->p_stat == SSTOP) { + unsleep(q); setrunnable(q); atomic_inc_int(&pr->ps_singlecount); } @@ -2125,12 +2127,14 @@ single_thread_set(struct proc *p, int fl break; } /* need to unwind or exit, so wake it */ + unsleep(q); setrunnable(q); } atomic_inc_int(&pr->ps_singlecount); break; case SSTOP: if (mode == SINGLE_EXIT) { + unsleep(q); setrunnable(q); atomic_inc_int(&pr->ps_singlecount); } @@ -2198,9 +2202,10 @@ single_thread_clear(struct proc *p, int * it back into some sleep queue */ if (q->p_stat == SSTOP && (q->p_flag & flag) == 0) { - if (q->p_wchan == NULL) + if (q->p_wchan == NULL) { + unsleep(q); setrunnable(q); - else { + } else { atomic_clearbits_int(&q->p_flag, P_WSLEEP); q->p_stat = SSLEEP; } Index: kern/kern_synch.c =================================================================== RCS file: /cvs/src/sys/kern/kern_synch.c,v retrieving revision 1.200 diff -u -p -r1.200 kern_synch.c --- kern/kern_synch.c 13 Sep 2023 14:25:49 -0000 1.200 +++ kern/kern_synch.c 22 Feb 2024 14:55:05 -0000 @@ -467,25 +467,23 @@ sleep_signal_check(void) } int -wakeup_proc(struct proc *p, const volatile void *chan, int flags) +wakeup_proc(struct proc *p, int flags) { int awakened = 0; SCHED_ASSERT_LOCKED(); - if (p->p_wchan != NULL && - ((chan == NULL) || (p->p_wchan == chan))) { + if (p->p_wchan != NULL) { awakened = 1; if (flags) atomic_setbits_int(&p->p_flag, flags); - if (p->p_stat == SSLEEP) - setrunnable(p); - else if (p->p_stat == SSTOP) - unsleep(p); #ifdef DIAGNOSTIC - else - panic("wakeup: p_stat is %d", (int)p->p_stat); + if (p->p_stat != SSLEEP && p->p_stat != SSTOP) + panic("thread %d p_stat is %d", p->p_tid, p->p_stat); #endif + unsleep(p); + if (p->p_stat == SSLEEP) + setrunnable(p); } return awakened; @@ -505,7 +503,7 @@ endtsleep(void *arg) int s; SCHED_LOCK(s); - wakeup_proc(p, NULL, P_TIMEOUT); + wakeup_proc(p, P_TIMEOUT); SCHED_UNLOCK(s); } @@ -531,21 +529,35 @@ unsleep(struct proc *p) void wakeup_n(const volatile void *ident, int n) { - struct slpque *qp; + struct slpque *qp, wakeq; struct proc *p; struct proc *pnext; int s; + TAILQ_INIT(&wakeq); + SCHED_LOCK(s); qp = &slpque[LOOKUP(ident)]; for (p = TAILQ_FIRST(qp); p != NULL && n != 0; p = pnext) { pnext = TAILQ_NEXT(p, p_runq); #ifdef DIAGNOSTIC if (p->p_stat != SSLEEP && p->p_stat != SSTOP) - panic("wakeup: p_stat is %d", (int)p->p_stat); + panic("thread %d p_stat is %d", p->p_tid, p->p_stat); #endif - if (wakeup_proc(p, ident, 0)) + KASSERT(p->p_wchan != NULL); + if (p->p_wchan == ident) { + TAILQ_REMOVE(qp, p, p_runq); + p->p_wchan = NULL; + TAILQ_INSERT_TAIL(&wakeq, p, p_runq); --n; + } + } + while ((p = TAILQ_FIRST(&wakeq))) { + TAILQ_REMOVE(&wakeq, p, p_runq); + TRACEPOINT(sched, unsleep, p->p_tid + THREAD_PID_OFFSET, + p->p_p->ps_pid); + if (p->p_stat == SSLEEP) + setrunnable(p); } SCHED_UNLOCK(s); } Index: kern/sched_bsd.c =================================================================== RCS file: /cvs/src/sys/kern/sched_bsd.c,v retrieving revision 1.90 diff -u -p -r1.90 sched_bsd.c --- kern/sched_bsd.c 24 Jan 2024 19:23:38 -0000 1.90 +++ kern/sched_bsd.c 22 Feb 2024 14:24:31 -0000 @@ -500,12 +500,10 @@ setrunnable(struct proc *p) if ((pr->ps_flags & PS_TRACED) != 0 && pr->ps_xsig != 0) atomic_setbits_int(&p->p_siglist, sigmask(pr->ps_xsig)); prio = p->p_usrpri; - unsleep(p); setrunqueue(NULL, p, prio); break; case SSLEEP: prio = p->p_slppri; - unsleep(p); /* e.g. when sending signals */ /* if not yet asleep, don't add to runqueue */ if (ISSET(p->p_flag, P_WSLEEP)) Index: kern/sys_process.c =================================================================== RCS file: /cvs/src/sys/kern/sys_process.c,v retrieving revision 1.95 diff -u -p -r1.95 sys_process.c --- kern/sys_process.c 21 Nov 2023 14:00:13 -0000 1.95 +++ kern/sys_process.c 22 Feb 2024 14:25:34 -0000 @@ -493,6 +493,7 @@ ptrace_ctrl(struct proc *p, int req, pid if (t->p_stat == SSTOP) { tr->ps_xsig = data; SCHED_LOCK(s); + unsleep(t); setrunnable(t); SCHED_UNLOCK(s); } else { Index: sys/proc.h =================================================================== RCS file: /cvs/src/sys/sys/proc.h,v retrieving revision 1.356 diff -u -p -r1.356 proc.h --- sys/proc.h 3 Feb 2024 18:51:58 -0000 1.356 +++ sys/proc.h 22 Feb 2024 14:32:13 -0000 @@ -555,7 +555,7 @@ void procinit(void); void setpriority(struct proc *, uint32_t, uint8_t); void setrunnable(struct proc *); void endtsleep(void *); -int wakeup_proc(struct proc *, const volatile void *, int); +int wakeup_proc(struct proc *, int); void unsleep(struct proc *); void reaper(void *); __dead void exit1(struct proc *, int, int, int);