** Description changed:

  [SRU Justification]
  
  [Impact]
  
  Systems on Jammy running high-throughput DMA workloads experience soft lockups
  and RCU stalls in fq_flush_timeout, which result in system hangs.
  
  The IOVA allocator in the 5.15 kernel uses a per-CPU magazine cache (rcache) 
to
  avoid expensive rbtree operations. Each CPU has two magazines of 128 PFNs; 
when
  both are full, the primary "loaded" magazine is pushed to a global depot (a
  fixed-size array of 32 magazines per size-bin). When the depot is also full, 
the
  overflow magazine is freed via iova_magazine_free_pfns(), which acquires
  iova_rbtree_lock and performs up to 128 rbtree lookups and removals while
  holding it.
  
  The problem manifests through the flush-queue timer. Every 10ms,
  fq_flush_timeout fires in softirq context and drains all CPUs' flush queues 
in a
  single non-preemptible loop. Because __iova_rcache_insert uses raw_cpu_ptr(),
  all recycled IOVAs are funnelled into the timer CPU's magazines. Once those
  magazines and the shared depot are full, every subsequent overflow triggers
  the expensive iova_magazine_free_pfns, resulting in up to 128 rbtree 
operations
  under iova_rbtree_lock, all within the same softirq:
  
-   fq_flush_timeout (timer softirq on CPU X)
-     iova_domain_flush
-     for_each_possible_cpu(cpu):
-       fq_ring_free (up to IOVA_FQ_SIZE=256 entries)
-         free_iova_fast
-           __iova_rcache_insert (into CPU X's rcache via raw_cpu_ptr)
-             if depot_size >= 32:
-               iova_magazine_free_pfns (128 rbtree ops under iova_rbtree_lock)
+   fq_flush_timeout (timer softirq on CPU X)
+     iova_domain_flush
+     for_each_possible_cpu(cpu):
+       fq_ring_free (up to IOVA_FQ_SIZE=256 entries)
+         free_iova_fast
+           __iova_rcache_insert (into CPU X's rcache via raw_cpu_ptr)
+             if depot_size >= 32:
+               iova_magazine_free_pfns (128 rbtree ops under iova_rbtree_lock)
  
  The RCU stall trace from an affected system on 5.15.0-117 confirms this exact
  path with reliable stack frames:
  
-   native_queued_spin_lock_slowpath+0x2c/0x40
-   _raw_spin_lock_irqsave+0x3d/0x50
-   iova_magazine_free_pfns.part.0+0x20/0xd0
-   free_iova_fast+0x219/0x290
-   fq_ring_free+0xa8/0x170
-   fq_flush_timeout+0x74/0xc0
-   call_timer_fn
-   run_timer_softirq
-   __do_softirq
+   native_queued_spin_lock_slowpath+0x2c/0x40
+   _raw_spin_lock_irqsave+0x3d/0x50
+   iova_magazine_free_pfns.part.0+0x20/0xd0
+   free_iova_fast+0x219/0x290
+   fq_ring_free+0xa8/0x170
+   fq_flush_timeout+0x74/0xc0
+   call_timer_fn
+   run_timer_softirq
+   __do_softirq
  
  [Fix]
  
  Backport upstream commits, adapted for the 5.15 codebase:
  1. 911aa1245da8 ("iommu/iova: Make the rcache depot scale better")
  2. 233045378dbb ("iommu/iova: Manage the depot list size")
  
  Cherry-pick upstream commit:
  3. 7591c127f3b1 ("kmemleak: iommu/iova: fix transient kmemleak false 
positive")
  
  Patch 1 replaces the fixed-size depot array with an unbounded singly-linked
  list. Magazines are always pushed to the depot regardless of size. As a 
result,
  the overflow path and its inline call to iova_magazine_free_pfns are 
eliminated
  from __iova_rcache_insert.
  
  Patch 2 prevents unbounded memory growth of the now-unlimited depot by adding 
a
  delayed_work (background workqueue) that trims the depot when it exceeds
  num_online_cpus() magazines. This reclaim runs in process context, which is
  preemptible and sleepable, and therefore, cannot cause soft lockups.
  
  Patch 3 fixes a kmemleak false positive introduced by patch 1.
  
  Adaptations made for 5.15 backport:
  
  - Patches 1 and 2 modify both drivers/iommu/iova.c and include/linux/iova.h
-   because in 5.15, struct iova_rcache is defined in the header (upstream moved
-   it into iova.c in a prior refactoring series not present in 5.15).
+   because in 5.15, struct iova_rcache is defined in the header (upstream moved
+   it into iova.c in a prior refactoring series not present in 5.15).
  - The rcache init function in 5.15 is init_iova_rcaches() (static void, called
-   unconditionally from init_iova_domain) rather than upstream's
-   iova_domain_init_rcaches() (exported, returns int with error cleanup). The
-   backport preserves the 5.15 function signature and error handling pattern.
+   unconditionally from init_iova_domain) rather than upstream's
+   iova_domain_init_rcaches() (exported, returns int with error cleanup). The
+   backport preserves the 5.15 function signature and error handling pattern.
  - 5.15 uses top-of-function variable declarations rather than upstream's C99
-   in-loop declarations.
+   in-loop declarations.
  - The core logic (depot linked-list, overflow elimination, background worker) 
is
-   identical between upstream and the backport.
+   identical between upstream and the backport.
  
  [Test Plan]
  
  TODO
  
  [Where problems could occur]
  
  Regression risk is low as changes in patches 1 and 2 are confined to the IOVA
  rcache depot internals (drivers/iommu/iova.c and include/linux/iova.h). No
  changes have been made to IOVA allocation or free semantics from the caller's
  perspective. Patch 3 is purely diagnostic and has no runtime effect. Moreover,
  the fix is already available on Noble and Resolute, where it has been 
thoroughly
  tested.
  
+ One behavioral change worth noting is the depot memory usage profile. The old
+ code enforced a hard cap of 32 magazines per size-bin; when the depot was 
full,
+ overflow was freed immediately. The new code removes that cap and relies on a
+ delayed_work firing every 100ms to trim the depot. This means a burst of DMA
+ unmaps can temporarily accumulate more depot memory than the old code would 
have
+ allowed, since the background reclaim only runs on a 100ms clock. This is not 
a
+ bug in the patches, as upstream implements the same design. Rather, it is a
+ change in behavior compated to what 5.15 users have today. In practice, the 
risk
+ is low: each magazine is 1024 bytes, so even a large spike of unmaps on a 
+ many-CPU system represents modest memory, and the reclaim worker converges
+ quickly.
+ 
  [Other Info]
  
  Similar issues have been reported in [0], [1], and [2]. The fix has already 
been
  integrated into Noble and subsequent releases. Backporting this fix ensures
  stability for users of the 5.15 kernel.
  
  [0] - https://lkml.rescloud.iu.edu/2304.1/01286.html
  [1] - 
https://mailweb.openeuler.org/archives/list/[email protected]/message/FAOBDKYWJ5SNADM625H2A4YCOPRAIRGB/
  [2] - https://access.redhat.com/solutions/7031930

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2158106

Title:
  [Jammy] soft lockups and rcu stalls in fq_flush_timeout causing system
  hangs

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2158106/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to