On 3/16/26 10:49, Andrea Righi wrote:
> Hi Christian,
> 
> On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote:
>> SCX_KICK_WAIT causes kick_cpus_irq_workfn() to busy-wait using
>> smp_cond_load_acquire() until the target CPU's current SCX task has been
>> context-switched out (its kick_sync counter advanced).
>>
>> If multiple CPUs each issue SCX_KICK_WAIT targeting one another
>> concurrently — e.g. CPU A waits for CPU B, B waits for CPU C, C waits for
>> CPU A — all CPUs can end up wedged inside smp_cond_load_acquire()
>> simultaneously.  Because each victim CPU is spinning in hardirq/irq_work
>> context, it cannot reschedule, so no kick_sync counter ever advances and
>> the system deadlocks.
>>
>> Fix this by serializing access to the wait loop behind a global raw
>> spinlock (scx_kick_wait_lock).  Only one CPU at a time may execute the
>> wait loop; any other CPU that has SCX_KICK_WAIT work to do and fails to
>> acquire the lock records itself in scx_kick_wait_pending and returns.
>> When the active waiter finishes and releases the lock, it replays the
>> pending set by re-queuing each pending CPU's kick_cpus_irq_work, ensuring
>> no wait request is silently dropped.
>>
>> This is deliberately a coarse serialization: multiple simultaneous wait
>> operations now run sequentially, increasing latency.  In exchange,
>> deadlocks are impossible regardless of the cycle length (A->B->C->...->A).
>>
>> Also clear scx_kick_wait_pending in free_kick_syncs() so that any stale
>> bits left by a CPU that deferred just as the scheduler exited are reset
>> before the next scheduler instance loads.
>>
>> Fixes: 90e55164dad4 ("sched_ext: Implement SCX_KICK_WAIT")
>> Signed-off-by: Christian Loehle <[email protected]>
>> ---
>>  kernel/sched/ext.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 43 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
>> index 26a6ac2f8826..b63ae13d0486 100644
>> --- a/kernel/sched/ext.c
>> +++ b/kernel/sched/ext.c
>> @@ -89,6 +89,19 @@ struct scx_kick_syncs {
>>  
>>  static DEFINE_PER_CPU(struct scx_kick_syncs __rcu *, scx_kick_syncs);
>>  
>> +/*
>> + * Serialize %SCX_KICK_WAIT processing across CPUs to avoid wait cycles.
>> + * Callers failing to acquire @scx_kick_wait_lock defer by recording
>> + * themselves in @scx_kick_wait_pending and are retriggered when the active
>> + * waiter completes.
>> + *
>> + * Lock ordering: @scx_kick_wait_lock is always acquired before
>> + * @scx_kick_wait_pending_lock; the two are never taken in the opposite 
>> order.
>> + */
>> +static DEFINE_RAW_SPINLOCK(scx_kick_wait_lock);
>> +static DEFINE_RAW_SPINLOCK(scx_kick_wait_pending_lock);
>> +static cpumask_t scx_kick_wait_pending;
>> +
>>  /*
>>   * Direct dispatch marker.
>>   *
>> @@ -4279,6 +4292,13 @@ static void free_kick_syncs(void)
>>              if (to_free)
>>                      kvfree_rcu(to_free, rcu);
>>      }
>> +
>> +    /*
>> +     * Clear any CPUs that were waiting for the lock when the scheduler
>> +     * exited.  Their irq_work has already returned so no in-flight
>> +     * waiter can observe the stale bits on the next enable.
>> +     */
>> +    cpumask_clear(&scx_kick_wait_pending);
> 
> Do we need a raw_spin_lock/unlock(&scx_kick_wait_pending_lock) here to make
> sure we're not racing with with cpumask_set_cpu()/cpumask_clear_cpu()?
> Probably it's not that relevant at this point, but I'd keep the locking for
> correctness.

Of course, thanks. Noted for v2!

Are you fine with the approach, i.e. hitting it with the sledge hammer of global
serialization?
I have something more complex in mind too, but yeah, we'd need to at least 
either
let scx_bpf_kick_cpu() fail / -ERETRY or restrict kicking/kicked CPUs and 
introduce
a whole lot of infra, which seems a bit overkill for a apparently barely used
interface and also would be nasty to backport.

Reply via email to