On Thu, Jul 25, 2024 at 08:20:32PM +0200, Martin Pieuchot wrote:
> On 25/07/24(Thu) 17:33, Claudio Jeker wrote:
> > On Thu, Jul 25, 2024 at 05:15:32PM +0200, Martin Pieuchot wrote:
> > > On 25/07/24(Thu) 14:51, Claudio Jeker wrote:
> > > > On Thu, Jul 25, 2024 at 11:09:44AM +0200, Martin Pieuchot wrote:
> > > > [...] 
> > > > > > Index: kern/kern_synch.c
> > > > > > ===================================================================
> > > > > > RCS file: /cvs/src/sys/kern/kern_synch.c,v
> > > > > > diff -u -p -r1.206 kern_synch.c
> > > > > > --- kern/kern_synch.c       23 Jul 2024 08:38:02 -0000      1.206
> > > > > > +++ kern/kern_synch.c       24 Jul 2024 14:14:06 -0000
> > > > > > @@ -399,15 +399,18 @@ sleep_finish(int timo, int do_sleep)
> > > > > >      */
> > > > > >     if (p->p_wchan == NULL)
> > > > > >             do_sleep = 0;
> > > > > > +   KASSERT((p->p_flag & P_SINGLESLEEP) == 0);
> > > > > >     atomic_clearbits_int(&p->p_flag, P_WSLEEP);
> > > > > >  
> > > > > > +   /* If requested to stop always force a stop even if do_sleep == 
> > > > > > 0 */
> > > > > > +   if (p->p_stat == SSTOP)
> > > > > > +           do_sleep = 1;
> > > > > 
> > > > > This is also scary.  The problem with the current scheme is that we 
> > > > > don't
> > > > > know who changed `p_stat' and if we already did our context switch or
> > > > > not. 
> > > > 
> > > > It is scary indeed. I would prefer if sleep_signal_check() would not
> > > > randomly mi_switch() away behind our back. I first tried that but it is
> > > > harder then you think.
> > > 
> > > So maybe let's start by adding:
> > > 
> > >   KASSERT(!(p->p_stat == SSTOP && do_sleep == 0))
> > > 
> > > And see if something blows.
> > 
> > I can assure you single_thread_set() is able to put p->p_stat to SSTOP
> > between these lines:
> >                 if ((error = sleep_signal_check(p)) != 0) {
> >                         catch = 0;
> >                         do_sleep = 0;
> >                 }
> >         }
> > 
> >         SCHED_LOCK();
> > 
> > So it can happen but the window is reasonably small (mainly the call to
> > cursig() and some minimal other fluff) that the KASSERT will probably
> > never hit.
> 
> This whole discussion makes me believe that SINGLE_SUSPEND is the
> incorrect solution for this and should die.
> 
> Instead of trying to change the state of siblings in single_thread_set()
> and context switching in single_thread_check() all threads should stop
> inside cursig().
> 
> I really appreciate all the efforts you've put into debugging this.
> However I cannot believe that adding more hacks and checking for p_stat
> is the way to go.

It is not a hack. There will always be something like this unless you want
to revert back to a giant recursive scheduler lock that can block the full
sleep machinery from start to finish and accept the fact that OpenBSD will
never scale to more than 8CPUs.

You have to accept that putting a thread to sleep is dirty business and
that sleep finish must check that the transaction was actually successful.

I tried various approaches to fix this problem. All of them where
horrible. I spent a week on this bug and I would prefer to spend my time
on actually working on fixing SIGSTOP, dowait6() and finally replace the
sched_lock in sleep. All of this will happen in and around the same place
and maybe once we're done my "hack" is gone as well.

I will cycle back into this mess at some point but I will not rewrite the
ptrace interaction now. I'm happy to drop this and just accept the fact
that debugging multithreaded applications on OpenBSD is simply broken.

-- 
:wq Claudio

Reply via email to