Re: timeout_del_barrier(9): remove kernel lock

David Gwynne Tue, 04 May 2021 18:05:59 -0700

On Tue, May 04, 2021 at 11:54:31AM -0500, Scott Cheloha wrote:
> On Tue, May 04, 2021 at 11:21:27PM +1000, David Gwynne wrote:
> > On Tue, May 04, 2021 at 11:24:05AM +0200, Martin Pieuchot wrote:
> > > On 04/05/21(Tue) 01:10, Scott Cheloha wrote:
> > > > [...] 
> > > > I want to run softclock() without the kernel lock.  The way to go, I
> > > > think, is to first push the kernel lock down into timeout_run(), and
> > > > then to remove the kernel lock from each timeout, one by one.
> > > 
> > > Grabbing and releasing the KERNEL_LOCk() on a per-timeout basis creates
> > > more latency than running all timeouts in a batch after having grabbed
> > > the KERNEL_LOCK().  I doubt this is the best way forward.
> > > 
> > > > Before we can push the kernel lock down into timeout_run() we need to
> > > > remove the kernel lock from timeout_del_barrier(9).
> > > 
> > > Seems worth it on its own.
> > >
> > > > The kernel lock is used in timeout_del_barrier(9) to determine whether
> > > > the given timeout has stopped running.  Because the softclock() runs
> > > > with the kernel lock we currently assume that once the calling thread
> > > > has taken the kernel lock any onging softclock() must have returned
> > > > and relinquished the lock, so the timeout in question has returned.
> > 
> > as i'll try to explain below, it's not about waiting for a specific
> > timeout, it's about knowing that a currently running timeout has
> > finished.
> 
> I'm very confused about the distinction between the two.


It is subtle :(

> > > So you want to stop using the KERNEL_LOCK() to do the serialization?  
> > 
> > the KERNEL_LOCK in timeout_barrier cos it was an elegant^Wclever^W hack.
> > because timeouts are run under the kernel lock, i knew i could take the
> > lock and know that timeouts arent running anymore. that's all.
> > 
> > in hindsight this means that the thread calling timeout_barrier spins
> > when it could sleep, and worse, it spins waiting for all pending
> > timeouts to run. so yeah, i agree that undoing the hack is worth it on
> > its own.
> 
> Okay so we're all in agreement that this could be improved, cool.

Yep.

> > > > The simplest replacement I can think of is a volatile pointer to the
> > > > running timeout that we set before leaving the timeout_mutex and clear
> > > > after reentering the same during timeout_run().
> > > 
> > > Sounds like a condition variable protected by this mutex.  Interesting
> > > that cond_wait(9) doesn't work with a mutex. 
> > 
> > cond_wait borrows the sched lock to coordinate between the waiting
> > thread and the signalling context. the cond api is basically a wrapper
> > around sleep_setup/sleep_finish.
> > 
> > > > So in the non-TIMEOUT_PROC case the timeout_del_barrier(9) caller just
> > > > spins until the timeout function returns and the timeout_running
> > > > pointer is changed.  Not every caller can sleep during
> > > > timeout_del_barrier(9).  I think spinning is the simplest thing that
> > > > will definitely work here.
> > > 
> > > This keeps the current semantic indeed.
> > 
> > i don't want to throw timeout_barrier out just yet.
> 
> I'm fine with that... where might it be useful?

A single timeout_barrier call can be used as a barrier for multiple
timeouts. This would feel more natural if timeout_barrier took a flag
instead of a timeout pointer, or if we didnt have the timeout_proc
context.

> > > > -void
> > > > -timeout_barrier(struct timeout *to)
> > > > +int
> > > > +timeout_del_barrier(struct timeout *to)
> > > >  {
> > > > +       struct timeout barrier;
> > > > +       struct cond c = COND_INITIALIZER();
> > > >         int needsproc = ISSET(to->to_flags, TIMEOUT_PROC);
> > > >  
> > > >         timeout_sync_order(needsproc);
> > > >  
> > > > -       if (!needsproc) {
> > > > -               KERNEL_LOCK();
> > > > -               splx(splsoftclock());
> > > > -               KERNEL_UNLOCK();
> > > > -       } else {
> > > > -               struct cond c = COND_INITIALIZER();
> > > > -               struct timeout barrier;
> > > > +       mtx_enter(&timeout_mutex);
> > > > +
> > > > +       if (timeout_del_locked(to)) {
> > > > +               mtx_leave(&timeout_mutex);
> > > > +               return 1;
> > > > +       }
> > > >  
> > > > +       if (needsproc) {
> > > >                 timeout_set_proc(&barrier, timeout_proc_barrier, &c);
> > > >                 barrier.to_process = curproc->p_p;
> > > > -
> > > > -               mtx_enter(&timeout_mutex);
> > > >                 SET(barrier.to_flags, TIMEOUT_ONQUEUE);
> > > >                 CIRCQ_INSERT_TAIL(&timeout_proc, &barrier.to_list);
> > > >                 mtx_leave(&timeout_mutex);
> > > > -
> > > >                 wakeup_one(&timeout_proc);
> > > > -
> > > >                 cond_wait(&c, "tmobar");
> > > > +       } else {
> > > > +               mtx_leave(&timeout_mutex);
> > > > +               /* XXX Is this in the right spot? */
> > > > +               splx(splsoftclock());
> > > > +               while (timeout_running == to)
> > > > +                       CPU_BUSY_CYCLE();
> > > 
> > > Won't splx() will execute the soft-interrupt if there's any pending?
> > > Shouldn't the barrier be before?  Could you add `spc->spc_spinning++'
> > > around the spinning loop?  What happen if two threads call
> > > timeout_del_barrier(9) with the same argument at the same time?  Is
> > > it possible and/or supported?
> > 
> > the timeout passed to timeout_barrier is only used to figure out which
> > context the barrier should apply to, ie, it's used to pick between the
> > softint or proc queue runners. like the other barriers in other parts of
> > the kernel, it's just supposed to wait for the current work to finish
> > running.
> 
> Here is where I get confused.
> 
> Why do I want to wait for *all* timeouts in the queue to finish
> running?

You don't, you just want to know that whatever timeout was running
has finished cos you don't know if that currently running timeout is
yours or not.

> My understanding of the timeout_del_barrier(9) use case was as
> follows:
> 
> 1. I have a dynamically allocated timeout struct.  The timeout is
>    scheduled to execute at some point in the future.
> 
> 2. I want to free the memory allocated to for the timeout.
> 
> 3. To safely free the memory I need to ensure the timeout
>    is not executing.  Until the timeout function, i.e.
>    to->to_func(), has returned it isn't necessarily safe to
>    free that memory.
> 
> 4. If I call timeout_del_barrier(9), it will only return if the
>    timeout in question is not running.  Assuming the timeout itself
>    is not a periodic timeout that reschedules itself we can then
>    safely free the memory.

Barriers aren't about references to timeouts, they're about references
to the work associated with a timeout.

If you only cared about the timeout struct itself, then you can get
all the information about whether it's referenced by the timeout
queues from the return values from timeout_add and timeout_del.
timeout_add() returns 1 if the timeout subsystem takes the reference
to the timeout. If the timeout is already scheduled it returns 0
because it's already got a reference. timeout_del returns 1 if the
timeout itself was scheduled and it removed it, therefore giving
up it's reference. If you timeout_del and it returns 0, then it
wasn't on a queue inside timeouts. Easy.

What timeout_add and timeout_del don't tell you is whether the work
referenced by the timeout is currently running. The timeout runners
copy the function and argument onto the stack when they dequeue a
timeout to run, and give up the reference to the timeout when the
mutex around the timeout wheels/cirqs is released. However, the argument
is still referenced on the stack and the function to process it may be
about to run or is running. You can't tell that from the timeout_add/del
return values though.

We provide two ways to deal with that. One is you have reference
counters (or similar) on the thing you're deferring to a timeout.
The other is you use a barrier so you know the work you deferred
isn't on the timeout runners stack anymore because it's moved on
to run the barrier work.

This is consistent with tasks and interrupts.

> Given this understanding, my approach was to focus on the timeout in
> question.  So my code just spins until it the timeout is no longer
> running.
> 
> But now I think I am mistaken.
> 
> IIUC you're telling me (and showing me, in code) that the goal of
> timeout_barrier() is to wait for the *whole* softclock() to return,
> not just the one timeout.

No, just the currently running timeout.

Putting the barrier timeout on the end of the proc and todo cirqs
is because there's no CIRCQ_INSERT_HEAD I can use right now.

> Why do I want to wait for the whole softclock() or softclock_thread()?
> Why not just wait for the one timeout to finish?

Cos I was too lazy to write CIRCQ_INSERT_HEAD last night :D

> > it is unfortunate that the timeout_barrier man page isn't very
> > clear. it's worth reading the manpage for intr_barrier and
> > taskq_barrier. sched_barrier is a thing that exists too, but isn't
> > documented. also have a look at the EXAMPLE in the cond_wait(9) manpage
> > too. however, don't read the taskq_barrier code. timeout_barrier
> > is like intr_barrier in that it uses the argument to find out where
> > work runs, but again, it doesn't care about the specific handler
> > itself. we could have a timeout_barrier(void) and a
> > timeout_barrier_proc(void) instead.
> 
> Ack.
> 
> > the point im trying to make here is that timeout_barrier isn't a
> > snowflake that exists in a vacuum, and should have similar semantics to
> > barriers in the other places in the kernel.
> 
> Also ack.
> 
> > something like this should work. i haven't kicked it around much
> > though, so maybe it doesn't. it would be nice if there was a
> > CIRQ_INSERT_HEAD to use, but that would be a separate change.

This discussion has some relevance to taskqs too. I was also lazy when I
implemented taskq_barrier and used task_add to put the barrier onto the
list of work, which meant it got added to the end. Now DRM relies on
this behaviour. Maybe I should have pushed to commit the diff below.

This diff lets taskq barriers run as soon as possible, but adds compat
to the drm workqueue stuff so they can wait for all pending work to
finish as they expect.

P.S. don't call timeout_del from inside a timeout handler. I still think
timeout_set_proc was a mistake.

Index: kern/kern_task.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_task.c,v
retrieving revision 1.30
diff -u -p -r1.30 kern_task.c
--- kern/kern_task.c    11 Jun 2020 06:06:55 -0000      1.30
+++ kern/kern_task.c    11 Jun 2020 12:52:20 -0000
@@ -288,7 +288,7 @@ taskq_do_barrier(struct taskq *tq)
        while (tq->tq_bthreads < tq->tq_nthreads) {
                /* shove the task into the queue for a worker to pick up */
                SET(t.t_flags, TASK_ONQUEUE);
-               TAILQ_INSERT_TAIL(&tq->tq_worklist, &t, t_entry);
+               TAILQ_INSERT_HEAD(&tq->tq_worklist, &t, t_entry);
                wakeup_one(tq);
 
                msleep_nsec(&tq->tq_bthreads, &tq->tq_mtx,
Index: dev/pci/drm/drm_linux.c
===================================================================
RCS file: /cvs/src/sys/dev/pci/drm/drm_linux.c,v
retrieving revision 1.59
diff -u -p -r1.59 drm_linux.c
--- dev/pci/drm/drm_linux.c     8 Jun 2020 04:47:58 -0000       1.59
+++ dev/pci/drm/drm_linux.c     11 Jun 2020 12:52:20 -0000
@@ -147,13 +147,32 @@ wake_up_process(struct proc *p)
        return wakeup_proc(p, NULL);
 }
 
+struct flush_workqueue_state {
+       struct taskq    *tq;
+       struct cond      cv;
+};
+
+static void
+flush_workqueue_task(void *p)
+{
+       struct flush_workqueue_state *s = p;
+
+       taskq_barrier(s->tq);
+       cond_signal(&s->cv);
+}
+
 void
 flush_workqueue(struct workqueue_struct *wq)
 {
+       struct taskq *tq = (struct taskq *)wq;
+       struct flush_workqueue_state s = { tq, COND_INITIALIZER() };
+       struct task t = TASK_INITIALIZER(flush_workqueue_task, &s);
+               
        if (cold)
                return;
 
-       taskq_barrier((struct taskq *)wq);
+       task_add(tq, &t);
+       cond_wait(&s.cv, "flushwq");
 }
 
 bool

Re: timeout_del_barrier(9): remove kernel lock

Reply via email to