Public bug reported: BugLink: https://bugs.launchpad.net/bugs/2103564
[Impact] nfsd loops forever in nfsd4_cb_sequence_done() after it receives a NFS4ERR_DELAY and the connection is subsequently lost. What happens is that NFS4ERR_DELAY sets cb->cb_seq_status to -10008, but it is never set back to 1, so it just keeps sending NFS4ERR_DELAY. The stack trace looks like: watchdog: BUG: soft lockup - CPU#33 stuck for 22s! [kworker/u120:29:1520679] Kernel panic - not syncing: softlockup: hung tasks CPU: 33 PID: 1520679 Comm: kworker/u120:29 Tainted: G L 5.15.0-1069-gke #75-Ubuntu Workqueue: rpciod rpc_async_schedule [sunrpc] Call Trace: RIP: 0010:__rpc_sleep_on_priority_timeout+0x7b/0x110 [sunrpc] Code: 0f b6 f9 66 90 44 89 fa 48 89 de 4d 8d 7e 50 4c 89 f7 e8 c8 fb ff ff 4c 89 6b 28 49 8b 46 50 49 39 c7 74 5a 4d 3b 6e 60 78 54 <49> 8b 56 50 48 8d 43 60 48 89 42 08 48 89 53 60 4c 89 7b 68 49 89 ... rpc_sleep_on_timeout+0x56/0xa0 [sunrpc] rpc_delay+0x29/0x30 [sunrpc] nfsd4_cb_sequence_done+0x1b9/0x250 [nfsd] nfsd4_cb_done+0x1d/0xf0 [nfsd] pc_exit_task+0x5c/0x110 [sunrpc] ? __rpc_sleep_on_priority+0x80/0x80 [sunrpc] __rpc_execute+0x68/0x270 [sunrpc] rpc_async_schedule+0x30/0x50 [sunrpc] process_one_work+0x22b/0x3d0 worker_thread+0x53/0x420 ? process_one_work+0x3d0/0x3d0 kthread+0x12a/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x22/0x30 </TASK> There is no workaround. [Fix] This was fixed in 6.9-rc1 by: commit 961b4b5e86bf56a2e4b567f81682defa5cba957e From: Chuck Lever <chuck.le...@oracle.com> Date: Fri, 26 Jan 2024 12:45:17 -0500 Subject: NFSD: Reset cb_seq_status after NFS4ERR_DELAY Link: https://github.com/torvalds/linux/commit/961b4b5e86bf56a2e4b567f81682defa5cba957e This is present in 5.15.179 and 6.6.76 upstream stable. [Testcase] There is no known synthetic reproducer available. Currently we see it in production workloads on Google Kubernetes Engine, and we have successfully deployed and ran a test kernel in production with no further incidents occurring. Before it would lock up once a day. The test kernel is available in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf407307-test If you install the kernel from the ppa, the issue no longer occurs. [Where problems can occur] We are resetting the value of cb->cb_seq_status back to 1 to let it get out of its state machine, and to actually make some progress, instead of being trapped at NFS4ERR_DELAY. If a regression were to occur, it would affect NFS v4.x systems, and it wouldn't likely cause any real issues, likely some flapping between NFS4ERR_DELAY and sending callbacks. ** Affects: linux (Ubuntu) Importance: Undecided Status: Fix Released ** Affects: linux (Ubuntu Jammy) Importance: Medium Assignee: Matthew Ruffell (mruffell) Status: In Progress ** Affects: linux (Ubuntu Noble) Importance: Medium Assignee: Matthew Ruffell (mruffell) Status: In Progress ** Tags: sts ** Description changed: - BugLink: https://bugs.launchpad.net/bugs/ + BugLink: https://bugs.launchpad.net/bugs/2103564 [Impact] nfsd loops forever in nfsd4_cb_sequence_done() after it receives a NFS4ERR_DELAY and the connection is subsequently lost. What happens is that NFS4ERR_DELAY sets cb->cb_seq_status to -10008, but it is never set back to 1, so it just keeps sending NFS4ERR_DELAY. The stack trace looks like: watchdog: BUG: soft lockup - CPU#33 stuck for 22s! [kworker/u120:29:1520679] - Kernel panic - not syncing: softlockup: hung tasks + Kernel panic - not syncing: softlockup: hung tasks CPU: 33 PID: 1520679 Comm: kworker/u120:29 Tainted: G L 5.15.0-1069-gke #75-Ubuntu Workqueue: rpciod rpc_async_schedule [sunrpc] Call Trace: - RIP: 0010:__rpc_sleep_on_priority_timeout+0x7b/0x110 [sunrpc] - Code: 0f b6 f9 66 90 44 89 fa 48 89 de 4d 8d 7e 50 4c 89 f7 e8 c8 fb ff ff 4c 89 6b 28 49 8b 46 50 49 39 c7 74 5a 4d 3b 6e 60 78 54 <49> 8b 56 50 48 8d 43 60 48 89 42 08 48 89 53 60 4c 89 7b 68 49 89 + RIP: 0010:__rpc_sleep_on_priority_timeout+0x7b/0x110 [sunrpc] + Code: 0f b6 f9 66 90 44 89 fa 48 89 de 4d 8d 7e 50 4c 89 f7 e8 c8 fb ff ff 4c 89 6b 28 49 8b 46 50 49 39 c7 74 5a 4d 3b 6e 60 78 54 <49> 8b 56 50 48 8d 43 60 48 89 42 08 48 89 53 60 4c 89 7b 68 49 89 ... - rpc_sleep_on_timeout+0x56/0xa0 [sunrpc] - rpc_delay+0x29/0x30 [sunrpc] - nfsd4_cb_sequence_done+0x1b9/0x250 [nfsd] - nfsd4_cb_done+0x1d/0xf0 [nfsd] - pc_exit_task+0x5c/0x110 [sunrpc] - ? __rpc_sleep_on_priority+0x80/0x80 [sunrpc] - __rpc_execute+0x68/0x270 [sunrpc] - rpc_async_schedule+0x30/0x50 [sunrpc] - process_one_work+0x22b/0x3d0 - worker_thread+0x53/0x420 - ? process_one_work+0x3d0/0x3d0 - kthread+0x12a/0x150 - ? set_kthread_struct+0x50/0x50 - ret_from_fork+0x22/0x30 - </TASK> + rpc_sleep_on_timeout+0x56/0xa0 [sunrpc] + rpc_delay+0x29/0x30 [sunrpc] + nfsd4_cb_sequence_done+0x1b9/0x250 [nfsd] + nfsd4_cb_done+0x1d/0xf0 [nfsd] + pc_exit_task+0x5c/0x110 [sunrpc] + ? __rpc_sleep_on_priority+0x80/0x80 [sunrpc] + __rpc_execute+0x68/0x270 [sunrpc] + rpc_async_schedule+0x30/0x50 [sunrpc] + process_one_work+0x22b/0x3d0 + worker_thread+0x53/0x420 + ? process_one_work+0x3d0/0x3d0 + kthread+0x12a/0x150 + ? set_kthread_struct+0x50/0x50 + ret_from_fork+0x22/0x30 + </TASK> There is no workaround. [Fix] This was fixed in 6.9-rc1 by: commit 961b4b5e86bf56a2e4b567f81682defa5cba957e From: Chuck Lever <chuck.le...@oracle.com> Date: Fri, 26 Jan 2024 12:45:17 -0500 Subject: NFSD: Reset cb_seq_status after NFS4ERR_DELAY Link: https://github.com/torvalds/linux/commit/961b4b5e86bf56a2e4b567f81682defa5cba957e This is present in 5.15.179 and 6.6.76 upstream stable. [Testcase] There is no known synthetic reproducer available. Currently we see it in production workloads on Google Kubernetes Engine, and we have successfully deployed and ran a test kernel in production with no further incidents occurring. Before it would lock up once a day. The test kernel is available in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf407307-test If you install the kernel from the ppa, the issue no longer occurs. [Where problems can occur] We are resetting the value of cb->cb_seq_status back to 1 to let it get out of its state machine, and to actually make some progress, instead of being trapped at NFS4ERR_DELAY. If a regression were to occur, it would affect NFS v4.x systems, and it wouldn't likely cause any real issues, likely some flapping between NFS4ERR_DELAY and sending callbacks. ** Also affects: linux (Ubuntu Noble) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Jammy) Importance: Undecided Status: New ** Changed in: linux (Ubuntu) Status: New => Fix Released ** Changed in: linux (Ubuntu Jammy) Status: New => In Progress ** Changed in: linux (Ubuntu Noble) Status: New => In Progress ** Changed in: linux (Ubuntu Jammy) Importance: Undecided => Medium ** Changed in: linux (Ubuntu Noble) Importance: Undecided => Medium ** Changed in: linux (Ubuntu Jammy) Assignee: (unassigned) => Matthew Ruffell (mruffell) ** Changed in: linux (Ubuntu Noble) Assignee: (unassigned) => Matthew Ruffell (mruffell) ** Tags added: sts -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2103564 Title: nfsd hangs and never recovers after NFS4ERR_DELAY and a connection loss To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2103564/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs