Public bug reported:

BugLink: https://bugs.launchpad.net/bugs/2077044

[Impact]

A deadlock can occur in zap_pid_ns_processes() which can hang the system
due to RCU getting stuck.

zap_pid_ns_processes() has a busy loop that calls kernel_wait4() on a
child process of the namespace init task, waiting for it to exit. The
problem is, it clears TIF_SIGPENDING, but not TIF_NOTIFY_SIGNAL as well,
leading us to get stuck in the busy loop forever, due to the child
sleeping in synchronize_rcu(), and is never woken up due to the parent
being stuck in the busy loop and never calling schedule() or
rcu_note_context_switch().

A oops is:

Watchdog: BUG: soft lockup - CPU#3 stuck for 276s! [rcudeadlock:1836]
CPU: 3 PID: 1836 Comm: rcudeadlock Tainted: G             L    
5.15.0-117-generic #127-Ubuntu
RIP: 0010:_raw_read_lock+0xe/0x30
Code: f0 0f b1 17 74 08 31 c0 5d c3 cc cc cc cc b8 01 00 00 00 5d c3 cc cc cc 
cc 0f 1f 00 0f 1f 44 00 00 b8 00 02 00 00 f0 0f c1 07 <a9> ff 01 00 00 75 05 c3 
cc cc cc cc 55 48 89 e5 e8 4d 79 36 ff 5d
CR2: 000000c0002b0000
Call Trace:
 <IRQ>
 ? show_trace_log_lvl+0x1d6/0x2ea
 ? show_trace_log_lvl+0x1d6/0x2ea
 ? kernel_wait4+0xaf/0x150
 ? show_regs.part.0+0x23/0x29
 ? show_regs.cold+0x8/0xd
 ? watchdog_timer_fn+0x1be/0x220
 ? lockup_detector_update_enable+0x60/0x60
 ? __hrtimer_run_queues+0x107/0x230
 ? read_hv_clock_tsc_cs+0x9/0x30
 ? hrtimer_interrupt+0x101/0x220
 ? hv_stimer0_isr+0x20/0x30
 ? __sysvec_hyperv_stimer0+0x32/0x70
 ? sysvec_hyperv_stimer0+0x7b/0x90
 </IRQ>
 <TASK>
 ? asm_sysvec_hyperv_stimer0+0x1b/0x20
 ? _raw_read_lock+0xe/0x30
 ? do_wait+0xa0/0x310
 kernel_wait4+0xaf/0x150
 ? thread_group_exited+0x50/0x50
 zap_pid_ns_processes+0x111/0x1a0
 forget_original_parent+0x348/0x360
 exit_notify+0x4a/0x210
 do_exit+0x24f/0x3c0
 do_group_exit+0x3b/0xb0
 get_signal+0x150/0x900
 arch_do_signal_or_restart+0xde/0x100
 ? __x64_sys_futex+0x78/0x1e0
 exit_to_user_mode_loop+0xc4/0x160
 exit_to_user_mode_prepare+0xa3/0xb0
 syscall_exit_to_user_mode+0x27/0x50
 ? x64_sys_call+0x1022/0x1fa0
 do_syscall_64+0x63/0xb0
 ? __io_uring_add_tctx_node+0x111/0x1a0
 ? fput+0x13/0x20
 ? __do_sys_io_uring_enter+0x10d/0x540
 ? __smp_call_single_queue+0x59/0x90
 ? exit_to_user_mode_prepare+0x37/0xb0
 ? syscall_exit_to_user_mode+0x2c/0x50
 ? x64_sys_call+0x1819/0x1fa0
 ? do_syscall_64+0x63/0xb0
 ? try_to_wake_up+0x200/0x5a0
 ? wake_up_q+0x50/0x90
 ? futex_wake+0x159/0x190
 ? do_futex+0x162/0x1f0
 ? __x64_sys_futex+0x78/0x1e0
 ? switch_fpu_return+0x4e/0xc0
 ? exit_to_user_mode_prepare+0x37/0xb0
 ? syscall_exit_to_user_mode+0x2c/0x50
 ? x64_sys_call+0x1022/0x1fa0
 ? do_syscall_64+0x63/0xb0
 ? do_user_addr_fault+0x1e7/0x670
 ? exit_to_user_mode_prepare+0x37/0xb0
 ? irqentry_exit_to_user_mode+0xe/0x20
 ? irqentry_exit+0x1d/0x30
 ? exc_page_fault+0x89/0x170
 entry_SYSCALL_64_after_hwframe+0x6c/0xd6
 </TASK>

There is no known workaround.

[Fix]

This was fixed in the below commit in 6.10-rc5:

commit 7fea700e04bd3f424c2d836e98425782f97b494e
Author: Oleg Nesterov <o...@redhat.com>
Date:   Sat Jun 8 14:06:16 2024 +0200
Subject: zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING
Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7fea700e04bd3f424c2d836e98425782f97b494e

This patch has made its way to upstream stable, and is already applied to Ubuntu
kernels.

[Testcase]

There are two possible testcases to reproduce this issue.
This reproducer is courtesy of Rachel Menge, using the reproducers in her 
github repo:

https://github.com/rlmenge/rcu-soft-lock-issue-repro

Start a Jammy or Noble VM on Azure, D8sV3 will be plenty.

$ git clone https://github.com/rlmenge/rcu-soft-lock-issue-repro.git

npm repro:

Install Docker.

$ sudo docker run telescope.azurecr.io/issue-repro/zombie:v1.1.11
$ ./rcu-npm-repro.sh

go repro:

$ go mod init rcudeadlock.go
$ go mod tidy
$ CGO_ENABLED=0 go build -o ./rcudeadlock ./
$ sudo ./rcudeadlock

Look at dmesg. After some minutes, you should see the hung task timeout
from the impact section.

[Where problems can occur]

We are clearing TIF_NOTIFY_SIGNAL in the child, in order for signal_pending() 
to return false and not lead us to a busy wait loop.
This change should work as intended.

If a regression were to occur, it could potentially affect all processes
in namespaces.

[Other Info]

Upstream mailing list discussion:
https://lore.kernel.org/linux-kernel/1386cd49-36d0-4a5c-85e9-bc42056a5...@linux.microsoft.com/T/

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: Fix Released

** Affects: linux (Ubuntu Jammy)
     Importance: Medium
     Assignee: Matthew Ruffell (mruffell)
         Status: Fix Committed

** Affects: linux (Ubuntu Noble)
     Importance: Medium
     Assignee: Matthew Ruffell (mruffell)
         Status: Fix Committed


** Tags: sts

** Also affects: linux (Ubuntu Noble)
   Importance: Undecided
       Status: New

** Also affects: linux (Ubuntu Jammy)
   Importance: Undecided
       Status: New

** Changed in: linux (Ubuntu)
       Status: New => Fix Released

** Changed in: linux (Ubuntu Jammy)
       Status: New => Fix Committed

** Changed in: linux (Ubuntu Noble)
       Status: New => Fix Committed

** Changed in: linux (Ubuntu Jammy)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu Noble)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu Jammy)
     Assignee: (unassigned) => Matthew Ruffell (mruffell)

** Changed in: linux (Ubuntu Noble)
     Assignee: (unassigned) => Matthew Ruffell (mruffell)

** Description changed:

- BugLink: https://bugs.launchpad.net/bugs/
+ BugLink: https://bugs.launchpad.net/bugs/2077044
  
  [Impact]
  
  A deadlock can occur in zap_pid_ns_processes() which can hang the system
  due to RCU getting stuck.
  
  zap_pid_ns_processes() has a busy loop that calls kernel_wait4() on a
  child process of the namespace init task, waiting for it to exit. The
  problem is, it clears TIF_SIGPENDING, but not TIF_NOTIFY_SIGNAL as well,
  leading us to get stuck in the busy loop forever, due to the child
  sleeping in synchronize_rcu(), and is never woken up due to the parent
  being stuck in the busy loop and never calling schedule() or
  rcu_note_context_switch().
  
  A oops is:
  
  Watchdog: BUG: soft lockup - CPU#3 stuck for 276s! [rcudeadlock:1836]
  CPU: 3 PID: 1836 Comm: rcudeadlock Tainted: G             L    
5.15.0-117-generic #127-Ubuntu
  RIP: 0010:_raw_read_lock+0xe/0x30
  Code: f0 0f b1 17 74 08 31 c0 5d c3 cc cc cc cc b8 01 00 00 00 5d c3 cc cc cc 
cc 0f 1f 00 0f 1f 44 00 00 b8 00 02 00 00 f0 0f c1 07 <a9> ff 01 00 00 75 05 c3 
cc cc cc cc 55 48 89 e5 e8 4d 79 36 ff 5d
  CR2: 000000c0002b0000
  Call Trace:
-  <IRQ>
-  ? show_trace_log_lvl+0x1d6/0x2ea
-  ? show_trace_log_lvl+0x1d6/0x2ea
-  ? kernel_wait4+0xaf/0x150
-  ? show_regs.part.0+0x23/0x29
-  ? show_regs.cold+0x8/0xd
-  ? watchdog_timer_fn+0x1be/0x220
-  ? lockup_detector_update_enable+0x60/0x60
-  ? __hrtimer_run_queues+0x107/0x230
-  ? read_hv_clock_tsc_cs+0x9/0x30
-  ? hrtimer_interrupt+0x101/0x220
-  ? hv_stimer0_isr+0x20/0x30
-  ? __sysvec_hyperv_stimer0+0x32/0x70
-  ? sysvec_hyperv_stimer0+0x7b/0x90
-  </IRQ>
-  <TASK>
-  ? asm_sysvec_hyperv_stimer0+0x1b/0x20
-  ? _raw_read_lock+0xe/0x30
-  ? do_wait+0xa0/0x310
-  kernel_wait4+0xaf/0x150
-  ? thread_group_exited+0x50/0x50
-  zap_pid_ns_processes+0x111/0x1a0
-  forget_original_parent+0x348/0x360
-  exit_notify+0x4a/0x210
-  do_exit+0x24f/0x3c0
-  do_group_exit+0x3b/0xb0
-  get_signal+0x150/0x900
-  arch_do_signal_or_restart+0xde/0x100
-  ? __x64_sys_futex+0x78/0x1e0
-  exit_to_user_mode_loop+0xc4/0x160
-  exit_to_user_mode_prepare+0xa3/0xb0
-  syscall_exit_to_user_mode+0x27/0x50
-  ? x64_sys_call+0x1022/0x1fa0
-  do_syscall_64+0x63/0xb0
-  ? __io_uring_add_tctx_node+0x111/0x1a0
-  ? fput+0x13/0x20
-  ? __do_sys_io_uring_enter+0x10d/0x540
-  ? __smp_call_single_queue+0x59/0x90
-  ? exit_to_user_mode_prepare+0x37/0xb0
-  ? syscall_exit_to_user_mode+0x2c/0x50
-  ? x64_sys_call+0x1819/0x1fa0
-  ? do_syscall_64+0x63/0xb0
-  ? try_to_wake_up+0x200/0x5a0
-  ? wake_up_q+0x50/0x90
-  ? futex_wake+0x159/0x190
-  ? do_futex+0x162/0x1f0
-  ? __x64_sys_futex+0x78/0x1e0
-  ? switch_fpu_return+0x4e/0xc0
-  ? exit_to_user_mode_prepare+0x37/0xb0
-  ? syscall_exit_to_user_mode+0x2c/0x50
-  ? x64_sys_call+0x1022/0x1fa0
-  ? do_syscall_64+0x63/0xb0
-  ? do_user_addr_fault+0x1e7/0x670
-  ? exit_to_user_mode_prepare+0x37/0xb0
-  ? irqentry_exit_to_user_mode+0xe/0x20
-  ? irqentry_exit+0x1d/0x30
-  ? exc_page_fault+0x89/0x170
-  entry_SYSCALL_64_after_hwframe+0x6c/0xd6
-  </TASK>
+  <IRQ>
+  ? show_trace_log_lvl+0x1d6/0x2ea
+  ? show_trace_log_lvl+0x1d6/0x2ea
+  ? kernel_wait4+0xaf/0x150
+  ? show_regs.part.0+0x23/0x29
+  ? show_regs.cold+0x8/0xd
+  ? watchdog_timer_fn+0x1be/0x220
+  ? lockup_detector_update_enable+0x60/0x60
+  ? __hrtimer_run_queues+0x107/0x230
+  ? read_hv_clock_tsc_cs+0x9/0x30
+  ? hrtimer_interrupt+0x101/0x220
+  ? hv_stimer0_isr+0x20/0x30
+  ? __sysvec_hyperv_stimer0+0x32/0x70
+  ? sysvec_hyperv_stimer0+0x7b/0x90
+  </IRQ>
+  <TASK>
+  ? asm_sysvec_hyperv_stimer0+0x1b/0x20
+  ? _raw_read_lock+0xe/0x30
+  ? do_wait+0xa0/0x310
+  kernel_wait4+0xaf/0x150
+  ? thread_group_exited+0x50/0x50
+  zap_pid_ns_processes+0x111/0x1a0
+  forget_original_parent+0x348/0x360
+  exit_notify+0x4a/0x210
+  do_exit+0x24f/0x3c0
+  do_group_exit+0x3b/0xb0
+  get_signal+0x150/0x900
+  arch_do_signal_or_restart+0xde/0x100
+  ? __x64_sys_futex+0x78/0x1e0
+  exit_to_user_mode_loop+0xc4/0x160
+  exit_to_user_mode_prepare+0xa3/0xb0
+  syscall_exit_to_user_mode+0x27/0x50
+  ? x64_sys_call+0x1022/0x1fa0
+  do_syscall_64+0x63/0xb0
+  ? __io_uring_add_tctx_node+0x111/0x1a0
+  ? fput+0x13/0x20
+  ? __do_sys_io_uring_enter+0x10d/0x540
+  ? __smp_call_single_queue+0x59/0x90
+  ? exit_to_user_mode_prepare+0x37/0xb0
+  ? syscall_exit_to_user_mode+0x2c/0x50
+  ? x64_sys_call+0x1819/0x1fa0
+  ? do_syscall_64+0x63/0xb0
+  ? try_to_wake_up+0x200/0x5a0
+  ? wake_up_q+0x50/0x90
+  ? futex_wake+0x159/0x190
+  ? do_futex+0x162/0x1f0
+  ? __x64_sys_futex+0x78/0x1e0
+  ? switch_fpu_return+0x4e/0xc0
+  ? exit_to_user_mode_prepare+0x37/0xb0
+  ? syscall_exit_to_user_mode+0x2c/0x50
+  ? x64_sys_call+0x1022/0x1fa0
+  ? do_syscall_64+0x63/0xb0
+  ? do_user_addr_fault+0x1e7/0x670
+  ? exit_to_user_mode_prepare+0x37/0xb0
+  ? irqentry_exit_to_user_mode+0xe/0x20
+  ? irqentry_exit+0x1d/0x30
+  ? exc_page_fault+0x89/0x170
+  entry_SYSCALL_64_after_hwframe+0x6c/0xd6
+  </TASK>
  
  There is no known workaround.
  
  [Fix]
  
  This was fixed in the below commit in 6.10-rc5:
  
  commit 7fea700e04bd3f424c2d836e98425782f97b494e
  Author: Oleg Nesterov <o...@redhat.com>
  Date:   Sat Jun 8 14:06:16 2024 +0200
  Subject: zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with 
TIF_SIGPENDING
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7fea700e04bd3f424c2d836e98425782f97b494e
  
  This patch has made its way to upstream stable, and is already applied to 
Ubuntu
  kernels.
  
  [Testcase]
  
  There are two possible testcases to reproduce this issue.
  This reproducer is courtesy of Rachel Menge, using the reproducers in her 
github repo:
  
  https://github.com/rlmenge/rcu-soft-lock-issue-repro
  
  Start a Jammy or Noble VM on Azure, D8sV3 will be plenty.
  
  $ git clone https://github.com/rlmenge/rcu-soft-lock-issue-repro.git
  
  npm repro:
  
  Install Docker.
  
  $ sudo docker run telescope.azurecr.io/issue-repro/zombie:v1.1.11
  $ ./rcu-npm-repro.sh
  
  go repro:
  
  $ go mod init rcudeadlock.go
  $ go mod tidy
  $ CGO_ENABLED=0 go build -o ./rcudeadlock ./
  $ sudo ./rcudeadlock
  
  Look at dmesg. After some minutes, you should see the hung task timeout
  from the impact section.
  
  [Where problems can occur]
  
  We are clearing TIF_NOTIFY_SIGNAL in the child, in order for signal_pending() 
to return false and not lead us to a busy wait loop.
  This change should work as intended.
  
  If a regression were to occur, it could potentially affect all processes
  in namespaces.
  
  [Other Info]
  
  Upstream mailing list discussion:
  
https://lore.kernel.org/linux-kernel/1386cd49-36d0-4a5c-85e9-bc42056a5...@linux.microsoft.com/T/

** Tags added: sts

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2077044

Title:
  zap_pid_ns_processes() gets stuck in a busy loop when zombie processes
  are in namespace

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Noble:
  Fix Committed

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/2077044

  [Impact]

  A deadlock can occur in zap_pid_ns_processes() which can hang the
  system due to RCU getting stuck.

  zap_pid_ns_processes() has a busy loop that calls kernel_wait4() on a
  child process of the namespace init task, waiting for it to exit. The
  problem is, it clears TIF_SIGPENDING, but not TIF_NOTIFY_SIGNAL as
  well, leading us to get stuck in the busy loop forever, due to the
  child sleeping in synchronize_rcu(), and is never woken up due to the
  parent being stuck in the busy loop and never calling schedule() or
  rcu_note_context_switch().

  A oops is:

  Watchdog: BUG: soft lockup - CPU#3 stuck for 276s! [rcudeadlock:1836]
  CPU: 3 PID: 1836 Comm: rcudeadlock Tainted: G             L    
5.15.0-117-generic #127-Ubuntu
  RIP: 0010:_raw_read_lock+0xe/0x30
  Code: f0 0f b1 17 74 08 31 c0 5d c3 cc cc cc cc b8 01 00 00 00 5d c3 cc cc cc 
cc 0f 1f 00 0f 1f 44 00 00 b8 00 02 00 00 f0 0f c1 07 <a9> ff 01 00 00 75 05 c3 
cc cc cc cc 55 48 89 e5 e8 4d 79 36 ff 5d
  CR2: 000000c0002b0000
  Call Trace:
   <IRQ>
   ? show_trace_log_lvl+0x1d6/0x2ea
   ? show_trace_log_lvl+0x1d6/0x2ea
   ? kernel_wait4+0xaf/0x150
   ? show_regs.part.0+0x23/0x29
   ? show_regs.cold+0x8/0xd
   ? watchdog_timer_fn+0x1be/0x220
   ? lockup_detector_update_enable+0x60/0x60
   ? __hrtimer_run_queues+0x107/0x230
   ? read_hv_clock_tsc_cs+0x9/0x30
   ? hrtimer_interrupt+0x101/0x220
   ? hv_stimer0_isr+0x20/0x30
   ? __sysvec_hyperv_stimer0+0x32/0x70
   ? sysvec_hyperv_stimer0+0x7b/0x90
   </IRQ>
   <TASK>
   ? asm_sysvec_hyperv_stimer0+0x1b/0x20
   ? _raw_read_lock+0xe/0x30
   ? do_wait+0xa0/0x310
   kernel_wait4+0xaf/0x150
   ? thread_group_exited+0x50/0x50
   zap_pid_ns_processes+0x111/0x1a0
   forget_original_parent+0x348/0x360
   exit_notify+0x4a/0x210
   do_exit+0x24f/0x3c0
   do_group_exit+0x3b/0xb0
   get_signal+0x150/0x900
   arch_do_signal_or_restart+0xde/0x100
   ? __x64_sys_futex+0x78/0x1e0
   exit_to_user_mode_loop+0xc4/0x160
   exit_to_user_mode_prepare+0xa3/0xb0
   syscall_exit_to_user_mode+0x27/0x50
   ? x64_sys_call+0x1022/0x1fa0
   do_syscall_64+0x63/0xb0
   ? __io_uring_add_tctx_node+0x111/0x1a0
   ? fput+0x13/0x20
   ? __do_sys_io_uring_enter+0x10d/0x540
   ? __smp_call_single_queue+0x59/0x90
   ? exit_to_user_mode_prepare+0x37/0xb0
   ? syscall_exit_to_user_mode+0x2c/0x50
   ? x64_sys_call+0x1819/0x1fa0
   ? do_syscall_64+0x63/0xb0
   ? try_to_wake_up+0x200/0x5a0
   ? wake_up_q+0x50/0x90
   ? futex_wake+0x159/0x190
   ? do_futex+0x162/0x1f0
   ? __x64_sys_futex+0x78/0x1e0
   ? switch_fpu_return+0x4e/0xc0
   ? exit_to_user_mode_prepare+0x37/0xb0
   ? syscall_exit_to_user_mode+0x2c/0x50
   ? x64_sys_call+0x1022/0x1fa0
   ? do_syscall_64+0x63/0xb0
   ? do_user_addr_fault+0x1e7/0x670
   ? exit_to_user_mode_prepare+0x37/0xb0
   ? irqentry_exit_to_user_mode+0xe/0x20
   ? irqentry_exit+0x1d/0x30
   ? exc_page_fault+0x89/0x170
   entry_SYSCALL_64_after_hwframe+0x6c/0xd6
   </TASK>

  There is no known workaround.

  [Fix]

  This was fixed in the below commit in 6.10-rc5:

  commit 7fea700e04bd3f424c2d836e98425782f97b494e
  Author: Oleg Nesterov <o...@redhat.com>
  Date:   Sat Jun 8 14:06:16 2024 +0200
  Subject: zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with 
TIF_SIGPENDING
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7fea700e04bd3f424c2d836e98425782f97b494e

  This patch has made its way to upstream stable, and is already applied to 
Ubuntu
  kernels.

  [Testcase]

  There are two possible testcases to reproduce this issue.
  This reproducer is courtesy of Rachel Menge, using the reproducers in her 
github repo:

  https://github.com/rlmenge/rcu-soft-lock-issue-repro

  Start a Jammy or Noble VM on Azure, D8sV3 will be plenty.

  $ git clone https://github.com/rlmenge/rcu-soft-lock-issue-repro.git

  npm repro:

  Install Docker.

  $ sudo docker run telescope.azurecr.io/issue-repro/zombie:v1.1.11
  $ ./rcu-npm-repro.sh

  go repro:

  $ go mod init rcudeadlock.go
  $ go mod tidy
  $ CGO_ENABLED=0 go build -o ./rcudeadlock ./
  $ sudo ./rcudeadlock

  Look at dmesg. After some minutes, you should see the hung task
  timeout from the impact section.

  [Where problems can occur]

  We are clearing TIF_NOTIFY_SIGNAL in the child, in order for signal_pending() 
to return false and not lead us to a busy wait loop.
  This change should work as intended.

  If a regression were to occur, it could potentially affect all
  processes in namespaces.

  [Other Info]

  Upstream mailing list discussion:
  
https://lore.kernel.org/linux-kernel/1386cd49-36d0-4a5c-85e9-bc42056a5...@linux.microsoft.com/T/

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2077044/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to