Hello Dana,

On 20/06/24(Thu) 17:16, Dana Koch wrote:
> On Thu, Jun 20, 2024 at 3:33 PM Martin Pieuchot <m...@openbsd.org> wrote:
> >
> > Hello Dana,
> >
> > Thanks again for your report.
> >
> > On 19/06/24(Wed) 09:37, Dana Koch wrote:
> > > On Wed, Jun 19, 2024 at 6:58 AM Martin Pieuchot <m...@openbsd.org> wrote:
> > > > This is a lock order reversal reported by WITNESS.  Thankfully claudio@
> > > > already committed a fix for this on the 16th.  So please, try with
> > > > up-to-date sources
> > >
> > > Just to be paranoid, I built a kernel with recent sources and
> > > MP_LOCKDEBUG and WITNESS. I experienced both the "lock spun out" error
> > > after "starting network" -- but not on serial console, unfortunately
> > > -- and from `make -j24` as mentioned which I did capture.
> >
> > The problem is exposed by the many threads of lld(1).  While "starting
> > network" the boot process relinks a kernel.  More details below.
> >
> > Since when do you experience this issue?
> 
> Since I got the device earlier in June and put 7.5-current on it.
> 
> I have been trying fresh kernels each time; the most recent ones
> haven't been tripping at boot as frequently (perhaps the lock order
> reversal fix has solved part but not all of the underlying problem).
> 
> > The issue is related to the SCHED_LOCK(). Could you please next time use
> > "ps /o" in ddb, this will help me figure out which CPU trace correspond
> > to the process holding the KERNEL_LOCK().
> 
> Done. Here is a repro from today with `ps /o` output. (Perhaps worth
> noting the "lock spun out" message happening during the ddb session
> after `mach ddbcpu 9`, too.)

Could you try the diff below?  Stuart confirmed it prevents the hang on
his machine. 

Index: kern/kern_synch.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_synch.c,v
diff -u -p -r1.205 kern_synch.c
--- kern/kern_synch.c   3 Jun 2024 12:48:25 -0000       1.205
+++ kern/kern_synch.c   22 Jun 2024 12:57:37 -0000
@@ -576,25 +576,8 @@ wakeup(const volatile void *chan)
 int
 sys_sched_yield(struct proc *p, void *v, register_t *retval)
 {
-       struct proc *q;
-       uint8_t newprio;
-
-       /*
-        * If one of the threads of a multi-threaded process called
-        * sched_yield(2), drop its priority to ensure its siblings
-        * can make some progress.
-        */
-       mtx_enter(&p->p_p->ps_mtx);
-       newprio = p->p_usrpri;
-       TAILQ_FOREACH(q, &p->p_p->ps_threads, p_thr_link)
-               newprio = max(newprio, q->p_runpri);
-       mtx_leave(&p->p_p->ps_mtx);
-
-       SCHED_LOCK();
-       setrunqueue(p->p_cpu, p, newprio);
-       p->p_ru.ru_nvcsw++;
-       mi_switch();
-       SCHED_UNLOCK();
+       /* Force a sleep cycle to prevent contending on the SCHED_LOCK(). */
+       tsleep(&nowake, PUSER, "yield", 1);
 
        return (0);
 }

Reply via email to