** Description changed:
[SRU Justification]
[Impact]
Systems on Jammy running high-throughput DMA workloads experience soft lockups
and RCU stalls in fq_flush_timeout, which result in system hangs.
The IOVA allocator in the 5.15 kernel uses a per-CPU magazine cache (rcache)
to
avoid expensive rbtree operations. Each CPU has two magazines of 128 PFNs;
when
both are full, the primary "loaded" magazine is pushed to a global depot (a
fixed-size array of 32 magazines per size-bin). When the depot is also full,
the
overflow magazine is freed via iova_magazine_free_pfns(), which acquires
iova_rbtree_lock and performs up to 128 rbtree lookups and removals while
holding it.
The problem manifests through the flush-queue timer. Every 10ms,
fq_flush_timeout fires in softirq context and drains all CPUs' flush queues
in a
single non-preemptible loop. Because __iova_rcache_insert uses raw_cpu_ptr(),
all recycled IOVAs are funnelled into the timer CPU's magazines. Once those
magazines and the shared depot are full, every subsequent overflow triggers
the expensive iova_magazine_free_pfns, resulting in up to 128 rbtree
operations
under iova_rbtree_lock, all within the same softirq:
fq_flush_timeout (timer softirq on CPU X)
iova_domain_flush
for_each_possible_cpu(cpu):
fq_ring_free (up to IOVA_FQ_SIZE=256 entries)
free_iova_fast
__iova_rcache_insert (into CPU X's rcache via raw_cpu_ptr)
if depot_size >= 32:
iova_magazine_free_pfns (128 rbtree ops under iova_rbtree_lock)
The RCU stall trace from an affected system on 5.15.0-117 confirms this exact
path with reliable stack frames:
native_queued_spin_lock_slowpath+0x2c/0x40
_raw_spin_lock_irqsave+0x3d/0x50
iova_magazine_free_pfns.part.0+0x20/0xd0
free_iova_fast+0x219/0x290
fq_ring_free+0xa8/0x170
fq_flush_timeout+0x74/0xc0
call_timer_fn
run_timer_softirq
__do_softirq
[Fix]
Backport upstream commits, adapted for the 5.15 codebase:
1. 911aa1245da8 ("iommu/iova: Make the rcache depot scale better")
2. 233045378dbb ("iommu/iova: Manage the depot list size")
Cherry-pick upstream commit:
3. 7591c127f3b1 ("kmemleak: iommu/iova: fix transient kmemleak false
positive")
Patch 1 replaces the fixed-size depot array with an unbounded singly-linked
list. Magazines are always pushed to the depot regardless of size. As a
result,
the overflow path and its inline call to iova_magazine_free_pfns are
eliminated
from __iova_rcache_insert.
Patch 2 prevents unbounded memory growth of the now-unlimited depot by adding
a
delayed_work (background workqueue) that trims the depot when it exceeds
num_online_cpus() magazines. This reclaim runs in process context, which is
preemptible and sleepable, and therefore, cannot cause soft lockups.
Patch 3 fixes a kmemleak false positive introduced by patch 1.
Adaptations made for 5.15 backport:
- Patches 1 and 2 modify both drivers/iommu/iova.c and include/linux/iova.h
because in 5.15, struct iova_rcache is defined in the header (upstream moved
it into iova.c in a prior refactoring series not present in 5.15).
- The rcache init function in 5.15 is init_iova_rcaches() (static void, called
unconditionally from init_iova_domain) rather than upstream's
iova_domain_init_rcaches() (exported, returns int with error cleanup). The
backport preserves the 5.15 function signature and error handling pattern.
- 5.15 uses top-of-function variable declarations rather than upstream's C99
in-loop declarations.
- The core logic (depot linked-list, overflow elimination, background worker)
is
identical between upstream and the backport.
[Test Plan]
TODO
+ Test kernel at:
+ https://launchpad.net/~munirsid/+archive/ubuntu/sf4384770-bp
+
[Where problems could occur]
Regression risk is low as changes in patches 1 and 2 are confined to the IOVA
rcache depot internals (drivers/iommu/iova.c and include/linux/iova.h). No
changes have been made to IOVA allocation or free semantics from the caller's
perspective. Patch 3 is purely diagnostic and has no runtime effect. Moreover,
the fix is already available on Noble and Resolute, where it has been
thoroughly
tested.
One behavioral change worth noting is the depot memory usage profile. The old
code enforced a hard cap of 32 magazines per size-bin; when the depot was
full,
overflow was freed immediately. The new code removes that cap and relies on a
delayed_work firing every 100ms to trim the depot. This means a burst of DMA
unmaps can temporarily accumulate more depot memory than the old code would
have
allowed, since the background reclaim only runs on a 100ms clock. This is not
a
bug in the patches, as upstream implements the same design. Rather, it is a
change in behavior compated to what 5.15 users have today. In practice, the
risk
- is low: each magazine is 1024 bytes, so even a large spike of unmaps on a
+ is low: each magazine is 1024 bytes, so even a large spike of unmaps on a
many-CPU system represents modest memory, and the reclaim worker converges
quickly.
[Other Info]
Similar issues have been reported in [0], [1], and [2]. The fix has already
been
integrated into Noble and subsequent releases. Backporting this fix ensures
stability for users of the 5.15 kernel.
[0] - https://lkml.rescloud.iu.edu/2304.1/01286.html
[1] -
https://mailweb.openeuler.org/archives/list/[email protected]/message/FAOBDKYWJ5SNADM625H2A4YCOPRAIRGB/
[2] - https://access.redhat.com/solutions/7031930
** Description changed:
[SRU Justification]
[Impact]
Systems on Jammy running high-throughput DMA workloads experience soft lockups
and RCU stalls in fq_flush_timeout, which result in system hangs.
The IOVA allocator in the 5.15 kernel uses a per-CPU magazine cache (rcache)
to
avoid expensive rbtree operations. Each CPU has two magazines of 128 PFNs;
when
both are full, the primary "loaded" magazine is pushed to a global depot (a
fixed-size array of 32 magazines per size-bin). When the depot is also full,
the
overflow magazine is freed via iova_magazine_free_pfns(), which acquires
iova_rbtree_lock and performs up to 128 rbtree lookups and removals while
holding it.
The problem manifests through the flush-queue timer. Every 10ms,
fq_flush_timeout fires in softirq context and drains all CPUs' flush queues
in a
single non-preemptible loop. Because __iova_rcache_insert uses raw_cpu_ptr(),
all recycled IOVAs are funnelled into the timer CPU's magazines. Once those
magazines and the shared depot are full, every subsequent overflow triggers
the expensive iova_magazine_free_pfns, resulting in up to 128 rbtree
operations
under iova_rbtree_lock, all within the same softirq:
fq_flush_timeout (timer softirq on CPU X)
iova_domain_flush
for_each_possible_cpu(cpu):
fq_ring_free (up to IOVA_FQ_SIZE=256 entries)
free_iova_fast
__iova_rcache_insert (into CPU X's rcache via raw_cpu_ptr)
if depot_size >= 32:
iova_magazine_free_pfns (128 rbtree ops under iova_rbtree_lock)
The RCU stall trace from an affected system on 5.15.0-117 confirms this exact
path with reliable stack frames:
native_queued_spin_lock_slowpath+0x2c/0x40
_raw_spin_lock_irqsave+0x3d/0x50
iova_magazine_free_pfns.part.0+0x20/0xd0
free_iova_fast+0x219/0x290
fq_ring_free+0xa8/0x170
fq_flush_timeout+0x74/0xc0
call_timer_fn
run_timer_softirq
__do_softirq
[Fix]
Backport upstream commits, adapted for the 5.15 codebase:
1. 911aa1245da8 ("iommu/iova: Make the rcache depot scale better")
2. 233045378dbb ("iommu/iova: Manage the depot list size")
Cherry-pick upstream commit:
3. 7591c127f3b1 ("kmemleak: iommu/iova: fix transient kmemleak false
positive")
Patch 1 replaces the fixed-size depot array with an unbounded singly-linked
list. Magazines are always pushed to the depot regardless of size. As a
result,
the overflow path and its inline call to iova_magazine_free_pfns are
eliminated
from __iova_rcache_insert.
Patch 2 prevents unbounded memory growth of the now-unlimited depot by adding
a
delayed_work (background workqueue) that trims the depot when it exceeds
num_online_cpus() magazines. This reclaim runs in process context, which is
preemptible and sleepable, and therefore, cannot cause soft lockups.
Patch 3 fixes a kmemleak false positive introduced by patch 1.
Adaptations made for 5.15 backport:
- Patches 1 and 2 modify both drivers/iommu/iova.c and include/linux/iova.h
because in 5.15, struct iova_rcache is defined in the header (upstream moved
it into iova.c in a prior refactoring series not present in 5.15).
- The rcache init function in 5.15 is init_iova_rcaches() (static void, called
unconditionally from init_iova_domain) rather than upstream's
iova_domain_init_rcaches() (exported, returns int with error cleanup). The
backport preserves the 5.15 function signature and error handling pattern.
- 5.15 uses top-of-function variable declarations rather than upstream's C99
in-loop declarations.
- The core logic (depot linked-list, overflow elimination, background worker)
is
identical between upstream and the backport.
[Test Plan]
TODO
- Test kernel at:
+ Test kernel in:
https://launchpad.net/~munirsid/+archive/ubuntu/sf4384770-bp
[Where problems could occur]
Regression risk is low as changes in patches 1 and 2 are confined to the IOVA
rcache depot internals (drivers/iommu/iova.c and include/linux/iova.h). No
changes have been made to IOVA allocation or free semantics from the caller's
perspective. Patch 3 is purely diagnostic and has no runtime effect. Moreover,
the fix is already available on Noble and Resolute, where it has been
thoroughly
tested.
One behavioral change worth noting is the depot memory usage profile. The old
code enforced a hard cap of 32 magazines per size-bin; when the depot was
full,
overflow was freed immediately. The new code removes that cap and relies on a
delayed_work firing every 100ms to trim the depot. This means a burst of DMA
unmaps can temporarily accumulate more depot memory than the old code would
have
allowed, since the background reclaim only runs on a 100ms clock. This is not
a
bug in the patches, as upstream implements the same design. Rather, it is a
change in behavior compated to what 5.15 users have today. In practice, the
risk
is low: each magazine is 1024 bytes, so even a large spike of unmaps on a
many-CPU system represents modest memory, and the reclaim worker converges
quickly.
[Other Info]
Similar issues have been reported in [0], [1], and [2]. The fix has already
been
integrated into Noble and subsequent releases. Backporting this fix ensures
stability for users of the 5.15 kernel.
[0] - https://lkml.rescloud.iu.edu/2304.1/01286.html
[1] -
https://mailweb.openeuler.org/archives/list/[email protected]/message/FAOBDKYWJ5SNADM625H2A4YCOPRAIRGB/
[2] - https://access.redhat.com/solutions/7031930
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2158106
Title:
[Jammy] soft lockups and rcu stalls in fq_flush_timeout causing system
hangs
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2158106/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs