On Mon, Mar 16, 2026 at 11:12:43AM +0000, Christian Loehle wrote:
> On 3/16/26 10:49, Andrea Righi wrote:
> > Hi Christian,
> > 
> > On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote:
> >> SCX_KICK_WAIT causes kick_cpus_irq_workfn() to busy-wait using
> >> smp_cond_load_acquire() until the target CPU's current SCX task has been
> >> context-switched out (its kick_sync counter advanced).
> >>
> >> If multiple CPUs each issue SCX_KICK_WAIT targeting one another
> >> concurrently — e.g. CPU A waits for CPU B, B waits for CPU C, C waits for
> >> CPU A — all CPUs can end up wedged inside smp_cond_load_acquire()
> >> simultaneously.  Because each victim CPU is spinning in hardirq/irq_work
> >> context, it cannot reschedule, so no kick_sync counter ever advances and
> >> the system deadlocks.
> >>
> >> Fix this by serializing access to the wait loop behind a global raw
> >> spinlock (scx_kick_wait_lock).  Only one CPU at a time may execute the
> >> wait loop; any other CPU that has SCX_KICK_WAIT work to do and fails to
> >> acquire the lock records itself in scx_kick_wait_pending and returns.
> >> When the active waiter finishes and releases the lock, it replays the
> >> pending set by re-queuing each pending CPU's kick_cpus_irq_work, ensuring
> >> no wait request is silently dropped.
> >>
> >> This is deliberately a coarse serialization: multiple simultaneous wait
> >> operations now run sequentially, increasing latency.  In exchange,
> >> deadlocks are impossible regardless of the cycle length (A->B->C->...->A).
> >>
> >> Also clear scx_kick_wait_pending in free_kick_syncs() so that any stale
> >> bits left by a CPU that deferred just as the scheduler exited are reset
> >> before the next scheduler instance loads.
> >>
> >> Fixes: 90e55164dad4 ("sched_ext: Implement SCX_KICK_WAIT")
> >> Signed-off-by: Christian Loehle <[email protected]>
> >> ---
> >>  kernel/sched/ext.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
> >>  1 file changed, 43 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> >> index 26a6ac2f8826..b63ae13d0486 100644
> >> --- a/kernel/sched/ext.c
> >> +++ b/kernel/sched/ext.c
> >> @@ -89,6 +89,19 @@ struct scx_kick_syncs {
> >>  
> >>  static DEFINE_PER_CPU(struct scx_kick_syncs __rcu *, scx_kick_syncs);
> >>  
> >> +/*
> >> + * Serialize %SCX_KICK_WAIT processing across CPUs to avoid wait cycles.
> >> + * Callers failing to acquire @scx_kick_wait_lock defer by recording
> >> + * themselves in @scx_kick_wait_pending and are retriggered when the 
> >> active
> >> + * waiter completes.
> >> + *
> >> + * Lock ordering: @scx_kick_wait_lock is always acquired before
> >> + * @scx_kick_wait_pending_lock; the two are never taken in the opposite 
> >> order.
> >> + */
> >> +static DEFINE_RAW_SPINLOCK(scx_kick_wait_lock);
> >> +static DEFINE_RAW_SPINLOCK(scx_kick_wait_pending_lock);
> >> +static cpumask_t scx_kick_wait_pending;
> >> +
> >>  /*
> >>   * Direct dispatch marker.
> >>   *
> >> @@ -4279,6 +4292,13 @@ static void free_kick_syncs(void)
> >>            if (to_free)
> >>                    kvfree_rcu(to_free, rcu);
> >>    }
> >> +
> >> +  /*
> >> +   * Clear any CPUs that were waiting for the lock when the scheduler
> >> +   * exited.  Their irq_work has already returned so no in-flight
> >> +   * waiter can observe the stale bits on the next enable.
> >> +   */
> >> +  cpumask_clear(&scx_kick_wait_pending);
> > 
> > Do we need a raw_spin_lock/unlock(&scx_kick_wait_pending_lock) here to make
> > sure we're not racing with with cpumask_set_cpu()/cpumask_clear_cpu()?
> > Probably it's not that relevant at this point, but I'd keep the locking for
> > correctness.
> 
> Of course, thanks. Noted for v2!
> 
> Are you fine with the approach, i.e. hitting it with the sledge hammer of 
> global
> serialization?
> I have something more complex in mind too, but yeah, we'd need to at least 
> either
> let scx_bpf_kick_cpu() fail / -ERETRY or restrict kicking/kicked CPUs and 
> introduce
> a whole lot of infra, which seems a bit overkill for a apparently barely used
> interface and also would be nasty to backport.

Yes, the current approach looks reasonable to me, I think the potential
latency increase (assuming there's any noticeable increase) is totally
acceptable in order to fix the deadlock.

Thanks,
-Andrea

Reply via email to