On Sun, Mar 24, 2024 at 11:03:24AM +0100, Martin Pieuchot wrote: > On 22/02/24(Thu) 17:24, Claudio Jeker wrote: > > On Thu, Feb 22, 2024 at 04:16:57PM +0100, Martin Pieuchot wrote: > > > On 21/02/24(Wed) 13:05, Claudio Jeker wrote: > > > > On Tue, Feb 20, 2024 at 09:34:12PM +0100, Martin Pieuchot wrote: > > > > > On 28/10/21(Thu) 05:45, Visa Hankala wrote: > > > > > > On Wed, Oct 27, 2021 at 09:02:08PM -0400, Dave Voutila wrote: > > > > > > > Dave Voutila <d...@sisu.io> writes: > > > > > > > > > > > > > > > Was tinkering on a bt(5) script for trying to debug an issue in > > > > > > > > vmm(4) > > > > > > > > when I managed to start hitting a panic "wakeup: p_stat is 2" > > > > > > > > being > > > > > > > > triggered by kqueue coming from the softnet kernel task. > > > > > > > > > > > > > > > > I'm running an amd64 kernel built from the tree today (latest > > > > > > > > CVS commit > > > > > > > > id UynQo1r7kLKA0Q2p) with VMM_DEBUG option set and the defaults > > > > > > > > from > > > > > > > > GENERIC.MP. Userland is from the latest amd snap. > > > > > > > > > > > > > > > > To reproduce, I'm running a single OpenBSD-current guest under > > > > > > > > vmd(8) > > > > > > > > which I'm targeting with the following trivial btrace script I > > > > > > > > was > > > > > > > > working on to use for debugging something in vmm(4): > > > > > > > > > > > > > > > > tracepoint:sched:sleep / pid == $1 && tid == $2 /{ > > > > > > > > printf("pid %d, tid %d slept %d!\n", pid, tid, nsecs); > > > > > > > > } > > > > > > > > > > > > > > > > tracepoint:sched:wakeup / pid == $1 && tid == $2 /{ > > > > > > > > printf("pid %d, tid %d awoke %d!\n", pid, tid, nsecs); > > > > > > > > } > > > > > > > > > > > > > > Even easier reproduction: if you have 2 machines and can use > > > > > > > tcpbench(1) > > > > > > > between them, then while tcpbench is running target it with the > > > > > > > above > > > > > > > btrace script. I've found running the script, killing it with > > > > > > > ctrl-c, > > > > > > > and re-running it 2-3 times triggers the panic on my laptop. > > > > > > > > > > > > > > > > > > > > > > > Both times this happened I was trying to sysupgrade the vmd(8) > > > > > > > > guest > > > > > > > > while running the above btrace script. When I don't run the > > > > > > > > script, > > > > > > > > there is no panic. > > > > > > > > > > > > > > > > Image of the full backtrace is here: https://imgur.com/a/swW1qoj > > > > > > > > > > > > > > > > Simple transcript of the call stack after the panic() call > > > > > > > > looks like: > > > > > > > > > > > > > > > > wakeup_n > > > > > > > > kqueue_wakeup > > > > > > > > knote > > > > > > > > selwakekup > > > > > > > > tun_enqueue > > > > > > > > ether_output > > > > > > > > ip_output > > > > > > > > ip_forward > > > > > > > > ip_input_if > > > > > > > > ipv4_input > > > > > > > > ether_input > > > > > > > > if_input_process > > > > > > > > > > > > > > > > The other 3 cpu cores appeared to be in ipi handlers. (Image in > > > > > > > > that > > > > > > > > imgur link) > > > > > > > > > > > > I think the problem is recursion within the scheduler. Trace points > > > > > > invoke wakeup() directly, which is unsafe when done from within the > > > > > > run queue routines. > > > > > > > > > > > > One approach to avoid the recursion is to defer the wakeup() with > > > > > > a soft interrupt. The scheduler runs at an elevated system priority > > > > > > level that blocks soft interrupts. The deferred wakeup() is issued > > > > > > once > > > > > > the system priority level turns low enough. > > > > > > > > > > > > Unfortunately, the patch will get broken when someone adds trace > > > > > > points > > > > > > to soft interrupt scheduling... > > > > > > > > > > Thanks for the report. Sorry for the delay. I'm now really > > > > > interested > > > > > in fixing this bug because I'm heavily using btrace(8) to analyse the > > > > > scheduler and I hit this panic a couple of times. > > > > > > > > > > The problem is that `pnext' might no longer be on the sleepqueue > > > > > after a > > > > > tracepoint inside setrunnable() fired. Diff below fixes that by > > > > > making > > > > > wakeup_n() re-entrant. > > > > > > > > > > I'm not very interested in committing this diff because it relies on a > > > > > recursive SCHED_LOCK(). Instead I'd prefer to split wakeup_n() in two > > > > > stages: first unlink the threads then call setrunnable(). This > > > > > approach > > > > > will help us untangle the sleepqueue but needs a bit more shuffling, > > > > > like moving unsleep() out of setrunnable()... > > > > > > > > > > Claudio, Visa do you agree? > > > > > > > > [...] > > > > > > > > Ugh, this is too much magic for my little brain. > > > > I would prefer to unqueue the procs onto a new local list and have a > > > > special wakeup_proc for them. The code around wakeup and unsleep will > > > > need > > > > some changes to unlock it properly. > > > > > > Yes, that is what I had in mind. Diff below does that and works for > > > me(tm). > > > > > > The first loop removes threads from the sleep queue and the second put > > > them on a runqueue and optionally call tracepoints. > > > > > > Note that unsleep() has been removed from setrunnable() to reduce the > > > complexity of the code. Messing, sometimes, with the state of the > > > sleepqueue down in this path doesn't help me get a simpler picture of > > > this machinery. > > > > > > ok? > > > > I like this a lot. I think this is going in the right direction. > > Diff looks good but I only quickly looked at it. > > Untangling unsleep() from setrunnable() is a good thing (I hope it works). > > I'd like to move forward with this. Any ok?
One comment below. This is OK claudio@ but I think you should wait until we have snapshot builds again before putting this in. Or at least don't pile too many such diffs on top of each other before we have snaps going. > > > Index: dev/pci/drm/drm_linux.c > > > =================================================================== > > > RCS file: /cvs/src/sys/dev/pci/drm/drm_linux.c,v > > > retrieving revision 1.109 > > > diff -u -p -r1.109 drm_linux.c > > > --- dev/pci/drm/drm_linux.c 21 Jan 2024 13:36:40 -0000 1.109 > > > +++ dev/pci/drm/drm_linux.c 22 Feb 2024 14:32:27 -0000 > > > @@ -162,7 +162,7 @@ wake_up_process(struct proc *p) > > > int s, rv; > > > > > > SCHED_LOCK(s); > > > - rv = wakeup_proc(p, NULL, 0); > > > + rv = wakeup_proc(p, 0); > > > SCHED_UNLOCK(s); > > > return rv; > > > } > > > Index: kern/kern_sig.c > > > =================================================================== > > > RCS file: /cvs/src/sys/kern/kern_sig.c,v > > > retrieving revision 1.321 > > > diff -u -p -r1.321 kern_sig.c > > > --- kern/kern_sig.c 17 Jan 2024 22:22:25 -0000 1.321 > > > +++ kern/kern_sig.c 22 Feb 2024 14:25:16 -0000 > > > @@ -1200,6 +1200,7 @@ runfast: > > > if (p->p_usrpri > PUSER) > > > p->p_usrpri = PUSER; > > > run: > > > + unsleep(p); > > > setrunnable(p); > > > out: > > > /* finally adjust siglist */ > > > @@ -2104,6 +2105,7 @@ single_thread_set(struct proc *p, int fl > > > if (q->p_flag & P_WEXIT) { > > > if (mode == SINGLE_EXIT) { > > > if (q->p_stat == SSTOP) { > > > + unsleep(q); > > > setrunnable(q); > > > atomic_inc_int(&pr->ps_singlecount); > > > } > > > @@ -2125,12 +2127,14 @@ single_thread_set(struct proc *p, int fl > > > break; > > > } > > > /* need to unwind or exit, so wake it */ > > > + unsleep(q); > > > setrunnable(q); > > > } > > > atomic_inc_int(&pr->ps_singlecount); > > > break; > > > case SSTOP: > > > if (mode == SINGLE_EXIT) { > > > + unsleep(q); > > > setrunnable(q); > > > atomic_inc_int(&pr->ps_singlecount); > > > } > > > @@ -2198,9 +2202,10 @@ single_thread_clear(struct proc *p, int > > > * it back into some sleep queue > > > */ > > > if (q->p_stat == SSTOP && (q->p_flag & flag) == 0) { > > > - if (q->p_wchan == NULL) > > > + if (q->p_wchan == NULL) { > > > + unsleep(q); This call makes little sense to me (apart from a mechanical change) If q->p_wchan == NULL then unsleep(q) does nothing. So this call just needs a setrunnable() call. > > > setrunnable(q); > > > - else { > > > + } else { > > > atomic_clearbits_int(&q->p_flag, P_WSLEEP); > > > q->p_stat = SSLEEP; > > > } > > > Index: kern/kern_synch.c > > > =================================================================== > > > RCS file: /cvs/src/sys/kern/kern_synch.c,v > > > retrieving revision 1.200 > > > diff -u -p -r1.200 kern_synch.c > > > --- kern/kern_synch.c 13 Sep 2023 14:25:49 -0000 1.200 > > > +++ kern/kern_synch.c 22 Feb 2024 14:55:05 -0000 > > > @@ -467,25 +467,23 @@ sleep_signal_check(void) > > > } > > > > > > int > > > -wakeup_proc(struct proc *p, const volatile void *chan, int flags) > > > +wakeup_proc(struct proc *p, int flags) > > > { > > > int awakened = 0; > > > > > > SCHED_ASSERT_LOCKED(); > > > > > > - if (p->p_wchan != NULL && > > > - ((chan == NULL) || (p->p_wchan == chan))) { > > > + if (p->p_wchan != NULL) { > > > awakened = 1; > > > if (flags) > > > atomic_setbits_int(&p->p_flag, flags); > > > - if (p->p_stat == SSLEEP) > > > - setrunnable(p); > > > - else if (p->p_stat == SSTOP) > > > - unsleep(p); > > > #ifdef DIAGNOSTIC > > > - else > > > - panic("wakeup: p_stat is %d", (int)p->p_stat); > > > + if (p->p_stat != SSLEEP && p->p_stat != SSTOP) > > > + panic("thread %d p_stat is %d", p->p_tid, p->p_stat); > > > #endif > > > + unsleep(p); > > > + if (p->p_stat == SSLEEP) > > > + setrunnable(p); > > > } > > > > > > return awakened; > > > @@ -505,7 +503,7 @@ endtsleep(void *arg) > > > int s; > > > > > > SCHED_LOCK(s); > > > - wakeup_proc(p, NULL, P_TIMEOUT); > > > + wakeup_proc(p, P_TIMEOUT); > > > SCHED_UNLOCK(s); > > > } > > > > > > @@ -531,21 +529,35 @@ unsleep(struct proc *p) > > > void > > > wakeup_n(const volatile void *ident, int n) > > > { > > > - struct slpque *qp; > > > + struct slpque *qp, wakeq; > > > struct proc *p; > > > struct proc *pnext; > > > int s; > > > > > > + TAILQ_INIT(&wakeq); > > > + > > > SCHED_LOCK(s); > > > qp = &slpque[LOOKUP(ident)]; > > > for (p = TAILQ_FIRST(qp); p != NULL && n != 0; p = pnext) { > > > pnext = TAILQ_NEXT(p, p_runq); > > > #ifdef DIAGNOSTIC > > > if (p->p_stat != SSLEEP && p->p_stat != SSTOP) > > > - panic("wakeup: p_stat is %d", (int)p->p_stat); > > > + panic("thread %d p_stat is %d", p->p_tid, p->p_stat); > > > #endif > > > - if (wakeup_proc(p, ident, 0)) > > > + KASSERT(p->p_wchan != NULL); > > > + if (p->p_wchan == ident) { > > > + TAILQ_REMOVE(qp, p, p_runq); > > > + p->p_wchan = NULL; > > > + TAILQ_INSERT_TAIL(&wakeq, p, p_runq); > > > --n; > > > + } > > > + } > > > + while ((p = TAILQ_FIRST(&wakeq))) { > > > + TAILQ_REMOVE(&wakeq, p, p_runq); > > > + TRACEPOINT(sched, unsleep, p->p_tid + THREAD_PID_OFFSET, > > > + p->p_p->ps_pid); > > > + if (p->p_stat == SSLEEP) > > > + setrunnable(p); > > > } > > > SCHED_UNLOCK(s); > > > } > > > Index: kern/sched_bsd.c > > > =================================================================== > > > RCS file: /cvs/src/sys/kern/sched_bsd.c,v > > > retrieving revision 1.90 > > > diff -u -p -r1.90 sched_bsd.c > > > --- kern/sched_bsd.c 24 Jan 2024 19:23:38 -0000 1.90 > > > +++ kern/sched_bsd.c 22 Feb 2024 14:24:31 -0000 > > > @@ -500,12 +500,10 @@ setrunnable(struct proc *p) > > > if ((pr->ps_flags & PS_TRACED) != 0 && pr->ps_xsig != 0) > > > atomic_setbits_int(&p->p_siglist, sigmask(pr->ps_xsig)); > > > prio = p->p_usrpri; > > > - unsleep(p); > > > setrunqueue(NULL, p, prio); > > > break; > > > case SSLEEP: > > > prio = p->p_slppri; > > > - unsleep(p); /* e.g. when sending signals */ > > > > > > /* if not yet asleep, don't add to runqueue */ > > > if (ISSET(p->p_flag, P_WSLEEP)) > > > Index: kern/sys_process.c > > > =================================================================== > > > RCS file: /cvs/src/sys/kern/sys_process.c,v > > > retrieving revision 1.95 > > > diff -u -p -r1.95 sys_process.c > > > --- kern/sys_process.c 21 Nov 2023 14:00:13 -0000 1.95 > > > +++ kern/sys_process.c 22 Feb 2024 14:25:34 -0000 > > > @@ -493,6 +493,7 @@ ptrace_ctrl(struct proc *p, int req, pid > > > if (t->p_stat == SSTOP) { > > > tr->ps_xsig = data; > > > SCHED_LOCK(s); > > > + unsleep(t); > > > setrunnable(t); > > > SCHED_UNLOCK(s); > > > } else { > > > Index: sys/proc.h > > > =================================================================== > > > RCS file: /cvs/src/sys/sys/proc.h,v > > > retrieving revision 1.356 > > > diff -u -p -r1.356 proc.h > > > --- sys/proc.h 3 Feb 2024 18:51:58 -0000 1.356 > > > +++ sys/proc.h 22 Feb 2024 14:32:13 -0000 > > > @@ -555,7 +555,7 @@ void procinit(void); > > > void setpriority(struct proc *, uint32_t, uint8_t); > > > void setrunnable(struct proc *); > > > void endtsleep(void *); > > > -int wakeup_proc(struct proc *, const volatile void *, int); > > > +int wakeup_proc(struct proc *, int); > > > void unsleep(struct proc *); > > > void reaper(void *); > > > __dead void exit1(struct proc *, int, int, int); > > > > -- > > :wq Claudio > > -- :wq Claudio