Public bug reported: BugLink: https://bugs.launchpad.net/bugs/2077044
[Impact] A deadlock can occur in zap_pid_ns_processes() which can hang the system due to RCU getting stuck. zap_pid_ns_processes() has a busy loop that calls kernel_wait4() on a child process of the namespace init task, waiting for it to exit. The problem is, it clears TIF_SIGPENDING, but not TIF_NOTIFY_SIGNAL as well, leading us to get stuck in the busy loop forever, due to the child sleeping in synchronize_rcu(), and is never woken up due to the parent being stuck in the busy loop and never calling schedule() or rcu_note_context_switch(). A oops is: Watchdog: BUG: soft lockup - CPU#3 stuck for 276s! [rcudeadlock:1836] CPU: 3 PID: 1836 Comm: rcudeadlock Tainted: G L 5.15.0-117-generic #127-Ubuntu RIP: 0010:_raw_read_lock+0xe/0x30 Code: f0 0f b1 17 74 08 31 c0 5d c3 cc cc cc cc b8 01 00 00 00 5d c3 cc cc cc cc 0f 1f 00 0f 1f 44 00 00 b8 00 02 00 00 f0 0f c1 07 <a9> ff 01 00 00 75 05 c3 cc cc cc cc 55 48 89 e5 e8 4d 79 36 ff 5d CR2: 000000c0002b0000 Call Trace: <IRQ> ? show_trace_log_lvl+0x1d6/0x2ea ? show_trace_log_lvl+0x1d6/0x2ea ? kernel_wait4+0xaf/0x150 ? show_regs.part.0+0x23/0x29 ? show_regs.cold+0x8/0xd ? watchdog_timer_fn+0x1be/0x220 ? lockup_detector_update_enable+0x60/0x60 ? __hrtimer_run_queues+0x107/0x230 ? read_hv_clock_tsc_cs+0x9/0x30 ? hrtimer_interrupt+0x101/0x220 ? hv_stimer0_isr+0x20/0x30 ? __sysvec_hyperv_stimer0+0x32/0x70 ? sysvec_hyperv_stimer0+0x7b/0x90 </IRQ> <TASK> ? asm_sysvec_hyperv_stimer0+0x1b/0x20 ? _raw_read_lock+0xe/0x30 ? do_wait+0xa0/0x310 kernel_wait4+0xaf/0x150 ? thread_group_exited+0x50/0x50 zap_pid_ns_processes+0x111/0x1a0 forget_original_parent+0x348/0x360 exit_notify+0x4a/0x210 do_exit+0x24f/0x3c0 do_group_exit+0x3b/0xb0 get_signal+0x150/0x900 arch_do_signal_or_restart+0xde/0x100 ? __x64_sys_futex+0x78/0x1e0 exit_to_user_mode_loop+0xc4/0x160 exit_to_user_mode_prepare+0xa3/0xb0 syscall_exit_to_user_mode+0x27/0x50 ? x64_sys_call+0x1022/0x1fa0 do_syscall_64+0x63/0xb0 ? __io_uring_add_tctx_node+0x111/0x1a0 ? fput+0x13/0x20 ? __do_sys_io_uring_enter+0x10d/0x540 ? __smp_call_single_queue+0x59/0x90 ? exit_to_user_mode_prepare+0x37/0xb0 ? syscall_exit_to_user_mode+0x2c/0x50 ? x64_sys_call+0x1819/0x1fa0 ? do_syscall_64+0x63/0xb0 ? try_to_wake_up+0x200/0x5a0 ? wake_up_q+0x50/0x90 ? futex_wake+0x159/0x190 ? do_futex+0x162/0x1f0 ? __x64_sys_futex+0x78/0x1e0 ? switch_fpu_return+0x4e/0xc0 ? exit_to_user_mode_prepare+0x37/0xb0 ? syscall_exit_to_user_mode+0x2c/0x50 ? x64_sys_call+0x1022/0x1fa0 ? do_syscall_64+0x63/0xb0 ? do_user_addr_fault+0x1e7/0x670 ? exit_to_user_mode_prepare+0x37/0xb0 ? irqentry_exit_to_user_mode+0xe/0x20 ? irqentry_exit+0x1d/0x30 ? exc_page_fault+0x89/0x170 entry_SYSCALL_64_after_hwframe+0x6c/0xd6 </TASK> There is no known workaround. [Fix] This was fixed in the below commit in 6.10-rc5: commit 7fea700e04bd3f424c2d836e98425782f97b494e Author: Oleg Nesterov <o...@redhat.com> Date: Sat Jun 8 14:06:16 2024 +0200 Subject: zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7fea700e04bd3f424c2d836e98425782f97b494e This patch has made its way to upstream stable, and is already applied to Ubuntu kernels. [Testcase] There are two possible testcases to reproduce this issue. This reproducer is courtesy of Rachel Menge, using the reproducers in her github repo: https://github.com/rlmenge/rcu-soft-lock-issue-repro Start a Jammy or Noble VM on Azure, D8sV3 will be plenty. $ git clone https://github.com/rlmenge/rcu-soft-lock-issue-repro.git npm repro: Install Docker. $ sudo docker run telescope.azurecr.io/issue-repro/zombie:v1.1.11 $ ./rcu-npm-repro.sh go repro: $ go mod init rcudeadlock.go $ go mod tidy $ CGO_ENABLED=0 go build -o ./rcudeadlock ./ $ sudo ./rcudeadlock Look at dmesg. After some minutes, you should see the hung task timeout from the impact section. [Where problems can occur] We are clearing TIF_NOTIFY_SIGNAL in the child, in order for signal_pending() to return false and not lead us to a busy wait loop. This change should work as intended. If a regression were to occur, it could potentially affect all processes in namespaces. [Other Info] Upstream mailing list discussion: https://lore.kernel.org/linux-kernel/1386cd49-36d0-4a5c-85e9-bc42056a5...@linux.microsoft.com/T/ ** Affects: linux (Ubuntu) Importance: Undecided Status: Fix Released ** Affects: linux (Ubuntu Jammy) Importance: Medium Assignee: Matthew Ruffell (mruffell) Status: Fix Committed ** Affects: linux (Ubuntu Noble) Importance: Medium Assignee: Matthew Ruffell (mruffell) Status: Fix Committed ** Tags: sts ** Also affects: linux (Ubuntu Noble) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Jammy) Importance: Undecided Status: New ** Changed in: linux (Ubuntu) Status: New => Fix Released ** Changed in: linux (Ubuntu Jammy) Status: New => Fix Committed ** Changed in: linux (Ubuntu Noble) Status: New => Fix Committed ** Changed in: linux (Ubuntu Jammy) Importance: Undecided => Medium ** Changed in: linux (Ubuntu Noble) Importance: Undecided => Medium ** Changed in: linux (Ubuntu Jammy) Assignee: (unassigned) => Matthew Ruffell (mruffell) ** Changed in: linux (Ubuntu Noble) Assignee: (unassigned) => Matthew Ruffell (mruffell) ** Description changed: - BugLink: https://bugs.launchpad.net/bugs/ + BugLink: https://bugs.launchpad.net/bugs/2077044 [Impact] A deadlock can occur in zap_pid_ns_processes() which can hang the system due to RCU getting stuck. zap_pid_ns_processes() has a busy loop that calls kernel_wait4() on a child process of the namespace init task, waiting for it to exit. The problem is, it clears TIF_SIGPENDING, but not TIF_NOTIFY_SIGNAL as well, leading us to get stuck in the busy loop forever, due to the child sleeping in synchronize_rcu(), and is never woken up due to the parent being stuck in the busy loop and never calling schedule() or rcu_note_context_switch(). A oops is: Watchdog: BUG: soft lockup - CPU#3 stuck for 276s! [rcudeadlock:1836] CPU: 3 PID: 1836 Comm: rcudeadlock Tainted: G L 5.15.0-117-generic #127-Ubuntu RIP: 0010:_raw_read_lock+0xe/0x30 Code: f0 0f b1 17 74 08 31 c0 5d c3 cc cc cc cc b8 01 00 00 00 5d c3 cc cc cc cc 0f 1f 00 0f 1f 44 00 00 b8 00 02 00 00 f0 0f c1 07 <a9> ff 01 00 00 75 05 c3 cc cc cc cc 55 48 89 e5 e8 4d 79 36 ff 5d CR2: 000000c0002b0000 Call Trace: - <IRQ> - ? show_trace_log_lvl+0x1d6/0x2ea - ? show_trace_log_lvl+0x1d6/0x2ea - ? kernel_wait4+0xaf/0x150 - ? show_regs.part.0+0x23/0x29 - ? show_regs.cold+0x8/0xd - ? watchdog_timer_fn+0x1be/0x220 - ? lockup_detector_update_enable+0x60/0x60 - ? __hrtimer_run_queues+0x107/0x230 - ? read_hv_clock_tsc_cs+0x9/0x30 - ? hrtimer_interrupt+0x101/0x220 - ? hv_stimer0_isr+0x20/0x30 - ? __sysvec_hyperv_stimer0+0x32/0x70 - ? sysvec_hyperv_stimer0+0x7b/0x90 - </IRQ> - <TASK> - ? asm_sysvec_hyperv_stimer0+0x1b/0x20 - ? _raw_read_lock+0xe/0x30 - ? do_wait+0xa0/0x310 - kernel_wait4+0xaf/0x150 - ? thread_group_exited+0x50/0x50 - zap_pid_ns_processes+0x111/0x1a0 - forget_original_parent+0x348/0x360 - exit_notify+0x4a/0x210 - do_exit+0x24f/0x3c0 - do_group_exit+0x3b/0xb0 - get_signal+0x150/0x900 - arch_do_signal_or_restart+0xde/0x100 - ? __x64_sys_futex+0x78/0x1e0 - exit_to_user_mode_loop+0xc4/0x160 - exit_to_user_mode_prepare+0xa3/0xb0 - syscall_exit_to_user_mode+0x27/0x50 - ? x64_sys_call+0x1022/0x1fa0 - do_syscall_64+0x63/0xb0 - ? __io_uring_add_tctx_node+0x111/0x1a0 - ? fput+0x13/0x20 - ? __do_sys_io_uring_enter+0x10d/0x540 - ? __smp_call_single_queue+0x59/0x90 - ? exit_to_user_mode_prepare+0x37/0xb0 - ? syscall_exit_to_user_mode+0x2c/0x50 - ? x64_sys_call+0x1819/0x1fa0 - ? do_syscall_64+0x63/0xb0 - ? try_to_wake_up+0x200/0x5a0 - ? wake_up_q+0x50/0x90 - ? futex_wake+0x159/0x190 - ? do_futex+0x162/0x1f0 - ? __x64_sys_futex+0x78/0x1e0 - ? switch_fpu_return+0x4e/0xc0 - ? exit_to_user_mode_prepare+0x37/0xb0 - ? syscall_exit_to_user_mode+0x2c/0x50 - ? x64_sys_call+0x1022/0x1fa0 - ? do_syscall_64+0x63/0xb0 - ? do_user_addr_fault+0x1e7/0x670 - ? exit_to_user_mode_prepare+0x37/0xb0 - ? irqentry_exit_to_user_mode+0xe/0x20 - ? irqentry_exit+0x1d/0x30 - ? exc_page_fault+0x89/0x170 - entry_SYSCALL_64_after_hwframe+0x6c/0xd6 - </TASK> + <IRQ> + ? show_trace_log_lvl+0x1d6/0x2ea + ? show_trace_log_lvl+0x1d6/0x2ea + ? kernel_wait4+0xaf/0x150 + ? show_regs.part.0+0x23/0x29 + ? show_regs.cold+0x8/0xd + ? watchdog_timer_fn+0x1be/0x220 + ? lockup_detector_update_enable+0x60/0x60 + ? __hrtimer_run_queues+0x107/0x230 + ? read_hv_clock_tsc_cs+0x9/0x30 + ? hrtimer_interrupt+0x101/0x220 + ? hv_stimer0_isr+0x20/0x30 + ? __sysvec_hyperv_stimer0+0x32/0x70 + ? sysvec_hyperv_stimer0+0x7b/0x90 + </IRQ> + <TASK> + ? asm_sysvec_hyperv_stimer0+0x1b/0x20 + ? _raw_read_lock+0xe/0x30 + ? do_wait+0xa0/0x310 + kernel_wait4+0xaf/0x150 + ? thread_group_exited+0x50/0x50 + zap_pid_ns_processes+0x111/0x1a0 + forget_original_parent+0x348/0x360 + exit_notify+0x4a/0x210 + do_exit+0x24f/0x3c0 + do_group_exit+0x3b/0xb0 + get_signal+0x150/0x900 + arch_do_signal_or_restart+0xde/0x100 + ? __x64_sys_futex+0x78/0x1e0 + exit_to_user_mode_loop+0xc4/0x160 + exit_to_user_mode_prepare+0xa3/0xb0 + syscall_exit_to_user_mode+0x27/0x50 + ? x64_sys_call+0x1022/0x1fa0 + do_syscall_64+0x63/0xb0 + ? __io_uring_add_tctx_node+0x111/0x1a0 + ? fput+0x13/0x20 + ? __do_sys_io_uring_enter+0x10d/0x540 + ? __smp_call_single_queue+0x59/0x90 + ? exit_to_user_mode_prepare+0x37/0xb0 + ? syscall_exit_to_user_mode+0x2c/0x50 + ? x64_sys_call+0x1819/0x1fa0 + ? do_syscall_64+0x63/0xb0 + ? try_to_wake_up+0x200/0x5a0 + ? wake_up_q+0x50/0x90 + ? futex_wake+0x159/0x190 + ? do_futex+0x162/0x1f0 + ? __x64_sys_futex+0x78/0x1e0 + ? switch_fpu_return+0x4e/0xc0 + ? exit_to_user_mode_prepare+0x37/0xb0 + ? syscall_exit_to_user_mode+0x2c/0x50 + ? x64_sys_call+0x1022/0x1fa0 + ? do_syscall_64+0x63/0xb0 + ? do_user_addr_fault+0x1e7/0x670 + ? exit_to_user_mode_prepare+0x37/0xb0 + ? irqentry_exit_to_user_mode+0xe/0x20 + ? irqentry_exit+0x1d/0x30 + ? exc_page_fault+0x89/0x170 + entry_SYSCALL_64_after_hwframe+0x6c/0xd6 + </TASK> There is no known workaround. [Fix] This was fixed in the below commit in 6.10-rc5: commit 7fea700e04bd3f424c2d836e98425782f97b494e Author: Oleg Nesterov <o...@redhat.com> Date: Sat Jun 8 14:06:16 2024 +0200 Subject: zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7fea700e04bd3f424c2d836e98425782f97b494e This patch has made its way to upstream stable, and is already applied to Ubuntu kernels. [Testcase] There are two possible testcases to reproduce this issue. This reproducer is courtesy of Rachel Menge, using the reproducers in her github repo: https://github.com/rlmenge/rcu-soft-lock-issue-repro Start a Jammy or Noble VM on Azure, D8sV3 will be plenty. $ git clone https://github.com/rlmenge/rcu-soft-lock-issue-repro.git npm repro: Install Docker. $ sudo docker run telescope.azurecr.io/issue-repro/zombie:v1.1.11 $ ./rcu-npm-repro.sh go repro: $ go mod init rcudeadlock.go $ go mod tidy $ CGO_ENABLED=0 go build -o ./rcudeadlock ./ $ sudo ./rcudeadlock Look at dmesg. After some minutes, you should see the hung task timeout from the impact section. [Where problems can occur] We are clearing TIF_NOTIFY_SIGNAL in the child, in order for signal_pending() to return false and not lead us to a busy wait loop. This change should work as intended. If a regression were to occur, it could potentially affect all processes in namespaces. [Other Info] Upstream mailing list discussion: https://lore.kernel.org/linux-kernel/1386cd49-36d0-4a5c-85e9-bc42056a5...@linux.microsoft.com/T/ ** Tags added: sts -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2077044 Title: zap_pid_ns_processes() gets stuck in a busy loop when zombie processes are in namespace Status in linux package in Ubuntu: Fix Released Status in linux source package in Jammy: Fix Committed Status in linux source package in Noble: Fix Committed Bug description: BugLink: https://bugs.launchpad.net/bugs/2077044 [Impact] A deadlock can occur in zap_pid_ns_processes() which can hang the system due to RCU getting stuck. zap_pid_ns_processes() has a busy loop that calls kernel_wait4() on a child process of the namespace init task, waiting for it to exit. The problem is, it clears TIF_SIGPENDING, but not TIF_NOTIFY_SIGNAL as well, leading us to get stuck in the busy loop forever, due to the child sleeping in synchronize_rcu(), and is never woken up due to the parent being stuck in the busy loop and never calling schedule() or rcu_note_context_switch(). A oops is: Watchdog: BUG: soft lockup - CPU#3 stuck for 276s! [rcudeadlock:1836] CPU: 3 PID: 1836 Comm: rcudeadlock Tainted: G L 5.15.0-117-generic #127-Ubuntu RIP: 0010:_raw_read_lock+0xe/0x30 Code: f0 0f b1 17 74 08 31 c0 5d c3 cc cc cc cc b8 01 00 00 00 5d c3 cc cc cc cc 0f 1f 00 0f 1f 44 00 00 b8 00 02 00 00 f0 0f c1 07 <a9> ff 01 00 00 75 05 c3 cc cc cc cc 55 48 89 e5 e8 4d 79 36 ff 5d CR2: 000000c0002b0000 Call Trace: <IRQ> ? show_trace_log_lvl+0x1d6/0x2ea ? show_trace_log_lvl+0x1d6/0x2ea ? kernel_wait4+0xaf/0x150 ? show_regs.part.0+0x23/0x29 ? show_regs.cold+0x8/0xd ? watchdog_timer_fn+0x1be/0x220 ? lockup_detector_update_enable+0x60/0x60 ? __hrtimer_run_queues+0x107/0x230 ? read_hv_clock_tsc_cs+0x9/0x30 ? hrtimer_interrupt+0x101/0x220 ? hv_stimer0_isr+0x20/0x30 ? __sysvec_hyperv_stimer0+0x32/0x70 ? sysvec_hyperv_stimer0+0x7b/0x90 </IRQ> <TASK> ? asm_sysvec_hyperv_stimer0+0x1b/0x20 ? _raw_read_lock+0xe/0x30 ? do_wait+0xa0/0x310 kernel_wait4+0xaf/0x150 ? thread_group_exited+0x50/0x50 zap_pid_ns_processes+0x111/0x1a0 forget_original_parent+0x348/0x360 exit_notify+0x4a/0x210 do_exit+0x24f/0x3c0 do_group_exit+0x3b/0xb0 get_signal+0x150/0x900 arch_do_signal_or_restart+0xde/0x100 ? __x64_sys_futex+0x78/0x1e0 exit_to_user_mode_loop+0xc4/0x160 exit_to_user_mode_prepare+0xa3/0xb0 syscall_exit_to_user_mode+0x27/0x50 ? x64_sys_call+0x1022/0x1fa0 do_syscall_64+0x63/0xb0 ? __io_uring_add_tctx_node+0x111/0x1a0 ? fput+0x13/0x20 ? __do_sys_io_uring_enter+0x10d/0x540 ? __smp_call_single_queue+0x59/0x90 ? exit_to_user_mode_prepare+0x37/0xb0 ? syscall_exit_to_user_mode+0x2c/0x50 ? x64_sys_call+0x1819/0x1fa0 ? do_syscall_64+0x63/0xb0 ? try_to_wake_up+0x200/0x5a0 ? wake_up_q+0x50/0x90 ? futex_wake+0x159/0x190 ? do_futex+0x162/0x1f0 ? __x64_sys_futex+0x78/0x1e0 ? switch_fpu_return+0x4e/0xc0 ? exit_to_user_mode_prepare+0x37/0xb0 ? syscall_exit_to_user_mode+0x2c/0x50 ? x64_sys_call+0x1022/0x1fa0 ? do_syscall_64+0x63/0xb0 ? do_user_addr_fault+0x1e7/0x670 ? exit_to_user_mode_prepare+0x37/0xb0 ? irqentry_exit_to_user_mode+0xe/0x20 ? irqentry_exit+0x1d/0x30 ? exc_page_fault+0x89/0x170 entry_SYSCALL_64_after_hwframe+0x6c/0xd6 </TASK> There is no known workaround. [Fix] This was fixed in the below commit in 6.10-rc5: commit 7fea700e04bd3f424c2d836e98425782f97b494e Author: Oleg Nesterov <o...@redhat.com> Date: Sat Jun 8 14:06:16 2024 +0200 Subject: zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7fea700e04bd3f424c2d836e98425782f97b494e This patch has made its way to upstream stable, and is already applied to Ubuntu kernels. [Testcase] There are two possible testcases to reproduce this issue. This reproducer is courtesy of Rachel Menge, using the reproducers in her github repo: https://github.com/rlmenge/rcu-soft-lock-issue-repro Start a Jammy or Noble VM on Azure, D8sV3 will be plenty. $ git clone https://github.com/rlmenge/rcu-soft-lock-issue-repro.git npm repro: Install Docker. $ sudo docker run telescope.azurecr.io/issue-repro/zombie:v1.1.11 $ ./rcu-npm-repro.sh go repro: $ go mod init rcudeadlock.go $ go mod tidy $ CGO_ENABLED=0 go build -o ./rcudeadlock ./ $ sudo ./rcudeadlock Look at dmesg. After some minutes, you should see the hung task timeout from the impact section. [Where problems can occur] We are clearing TIF_NOTIFY_SIGNAL in the child, in order for signal_pending() to return false and not lead us to a busy wait loop. This change should work as intended. If a regression were to occur, it could potentially affect all processes in namespaces. [Other Info] Upstream mailing list discussion: https://lore.kernel.org/linux-kernel/1386cd49-36d0-4a5c-85e9-bc42056a5...@linux.microsoft.com/T/ To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2077044/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp