This response was AI-generated by bug-bot. The analysis may contain errors —
please verify independently.
## Bug Summary
This is an RCU stall and hung task deadlock on 7.0.0-rc1, triggered by perf
trace teardown under perf interrupt storm conditions. The perf subsystem's
tracepoint unregistration path now blocks on SRCU (tracepoint_srcu), which in
turn blocks on RCU grace period completion, creating a cascading stall when RCU
progress is delayed by perf NMI interrupt storms. Severity: system hang
(multiple tasks blocked >143s, eventual complete stall).
## Stack Trace Analysis
The bug involves three interacting blocked entities. Here are the decoded stack
traces:
**1. repro2 (pid 4086) - blocked in perf trace teardown (close()):**
```
__x64_sys_close
fput_close_sync
__fput
perf_release
perf_event_release_kernel
put_event
__free_event
perf_trace_destroy
perf_trace_event_unreg
[kernel/trace/trace_event_perf.c:154]
tracepoint_synchronize_unregister
[include/linux/tracepoint.h:116]
synchronize_srcu(&tracepoint_srcu)
__synchronize_srcu
wait_for_completion ← BLOCKED
```
**2. kworker/0:0 (pid 9) and kworker/0:1 (pid 11) - SRCU grace period workers:**
```
Workqueue: rcu_gp process_srcu
process_srcu [kernel/rcu/srcutree.c:1304]
srcu_advance_state [kernel/rcu/srcutree.c:1161]
try_check_zero [kernel/rcu/srcutree.c:1171]
srcu_readers_active_idx_check [kernel/rcu/srcutree.c:544]
synchronize_rcu() ← SRCU-fast path, line 569
synchronize_rcu_normal
wait_for_completion ← BLOCKED
```
**3. repro2 (pid 4093) - RCU stall source:**
```
rcu: Tasks blocked on level-0 rcu_node (CPUs 0-1): P4093
task:repro2 state:R running task
(running in futex_wake syscall, interrupted by timer IRQ)
asm_sysvec_apic_timer_interrupt
irqentry_exit → preempt_schedule_irq → __schedule
finish_task_switch
```
The trace shows process context for the hung tasks and interrupt context (timer
IRQ) for the RCU stall detection. The kworkers are in D (uninterruptible sleep)
state, blocked in wait_for_completion() within the SRCU grace period state
machine.
## Root Cause Analysis
This is a regression introduced by commit a46023d5616ed ("tracing: Guard
__DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast"), which switched
tracepoint read-side protection from preempt_disable()+RCU to SRCU-fast via
DEFINE_SRCU_FAST(tracepoint_srcu).
The root cause is a new coupling between SRCU grace period processing and RCU
grace period completion that did not exist before. The deadlock chain is:
1. The reproducer creates perf events using tracepoints, then closes them while
generating heavy perf interrupt load. The perf NMI interrupt storms ("perf:
interrupt took too long" messages escalating from 69ms to 336ms) consume most
CPU time, starving RCU quiescent state detection.
2. When the perf fd is closed, perf_trace_event_unreg()
(kernel/trace/trace_event_perf.c:154) calls tracepoint_synchronize_unregister()
(include/linux/tracepoint.h:116), which now calls
synchronize_srcu(&tracepoint_srcu) instead of synchronize_rcu().
3. The SRCU grace period for tracepoint_srcu is processed by process_srcu()
running in the rcu_gp workqueue. Because tracepoint_srcu is DEFINE_SRCU_FAST,
its srcu_reader_flavor includes SRCU_READ_FLAVOR_FAST, which is part of
SRCU_READ_FLAVOR_SLOWGP.
4. In srcu_readers_active_idx_check() (kernel/rcu/srcutree.c:544), when
SRCU_READ_FLAVOR_SLOWGP is detected, the function calls synchronize_rcu() (line
569) instead of smp_mb() (line 301 in non-fast path). This is the key design
tradeoff of SRCU-fast: faster readers (no smp_mb() on read side) at the cost of
slower grace periods (synchronize_rcu() on update side).
5. synchronize_rcu() → synchronize_rcu_normal() → wait_for_completion(),
waiting for an RCU grace period to complete. But the RCU grace period is
stalled because the perf interrupt storms are preventing CPUs from passing
through quiescent states quickly enough.
6. Since process_srcu is blocked waiting for synchronize_rcu(), the
tracepoint_srcu SRCU grace period cannot advance, so
synchronize_srcu(&tracepoint_srcu) in the perf teardown path also blocks
indefinitely.
The pre-existing condition (perf NMI storms causing RCU stalls) was previously
tolerable because the perf teardown path used synchronize_rcu() directly (via
the old tracepoint_synchronize_unregister()), which would eventually complete
once the RCU stall resolved. Now, with SRCU-fast, there is an additional layer
of indirection: perf teardown waits on SRCU, SRCU processing waits on RCU, and
both the SRCU workqueue threads and the perf teardown task are stuck.
## Affected Versions
This is a regression in v7.0-rc1. The bug was introduced by commit
a46023d5616ed ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with
SRCU-fast"), which was merged via the trace-v7.0 merge (3c6e577d5ae70). The
underlying SRCU-fast infrastructure was added by commit c4020620528e4 ("srcu:
Add SRCU-fast readers") and 4d86b1e7e1e98 ("srcu: Add SRCU_READ_FLAVOR_SLOWGP
to flag need for synchronize_rcu()"), but the regression became triggerable
only when a46023d5616ed applied SRCU-fast to the tracepoint_srcu used in the
perf event teardown path.
Kernels before v7.0-rc1 (i.e., v6.x and earlier) are not affected, as they used
preempt_disable()+RCU for tracepoint protection, and
tracepoint_synchronize_unregister() called synchronize_rcu() directly without
SRCU involvement.
## Relevant Commits and Fixes
Key commits in the causal chain:
- a46023d5616ed ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL()
with SRCU-fast") - the commit that introduced the regression by switching
tracepoints to SRCU-fast
- a77cb6a867667 ("srcu: Fix warning to permit SRCU-fast readers in NMI
handlers") - immediate predecessor fix
- c4020620528e4 ("srcu: Add SRCU-fast readers") - added the SRCU-fast reader API
- 4d86b1e7e1e98 ("srcu: Add SRCU_READ_FLAVOR_SLOWGP to flag need for
synchronize_rcu()") - added the synchronize_rcu()-instead-of-smp_mb() logic in
SRCU grace period processing
- 16718274ee75d ("tracing: perf: Have perf tracepoint callbacks always disable
preemption") - preparatory commit for the SRCU-fast switch
No fix for this specific issue was found in mainline or in any -next branches
as of today.
## Prior Discussions
No prior reports of this specific RCU stall / SRCU deadlock triggered via perf
trace teardown with SRCU-fast were found on lore.kernel.org. The original
SRCU-fast tracepoint series was posted at
https://lore.kernel.org/all/[email protected]/ (linked from
the commit message), motivated by enabling preemptible BPF on tracepoints for
RT systems
(https://lore.kernel.org/all/[email protected]/).
No discussion of the synchronize_rcu()-from-workqueue stall scenario appears to
have taken place in those threads.
## Suggested Actions
1. Confirm the regression by testing with the parent commit a77cb6a867667
(immediately before a46023d5616ed). If the issue disappears, this confirms the
SRCU-fast tracepoint switch as the cause.
2. As a quick workaround, reverting a46023d5616ed (and its preparatory commits
a77cb6a867667, f7d327654b886, 16718274ee75d if needed) should eliminate the
deadlock, at the cost of losing preemptible BPF tracepoint support.
3. The fundamental issue is that process_srcu() for SRCU-fast structures calls
synchronize_rcu() synchronously from workqueue context. Possible fixes include:
- Using an asynchronous mechanism (e.g., call_rcu() with a callback to
resume SRCU GP processing) instead of blocking synchronize_rcu() within the
SRCU state machine.
- Having srcu_readers_active_idx_check() use poll_state_synchronize_rcu()
and defer retrying instead of blocking.
- Bounding the perf interrupt rate escalation to prevent the RCU stall in
the first place (though this would only mask the underlying SRCU↔RCU coupling
issue).
4. If you can reproduce reliably, adding the following debug options would
provide more information: CONFIG_RCU_TRACE=y, CONFIG_PROVE_RCU=y, and booting
with rcutree.rcu_kick_kthreads=1 to see if kicking the RCU threads helps break
the stall.