Hello. I try to address this bug without success for 4 months since I upgraded my two file servers to Debian Bookworm on August 2024. Finding a solution is critical for me as I manage a high school network where home directories are shared by NFS and actually I have a crash every week. But my situation may help to find what's wrong because the crash occur relatively often.
Here my current investigation. The three last stable Debian Linux kernels seems all affected by this bug on the server side : 6.1.112-1, 6.1.115-1, 6.1.119+1. I have not tested any previous Bookworm version actually. Is difficult for me to give the exact client kernel version as I have around 450 Debian Bookworm Desktop all configured with automatic upgrades. But they may be not rebooted/powered on for a long time. And I don't know actually how to determine witch client causing the crash. My two servers have completely different hardware and one is bare metal and the other is virtualized. So it seems not an hardware related problem. The crash always occur when there is some load on the servers. But there is no need to very high load. Sometimes the problem occur with very few students working (around 75 clients). Very strangely, in my case, the problem occur exactly one time per week, on one server. So I first thought about a log rotation problem. But I didn't find any clues in this direction. Load balancing don't resolve the issue. At first the problem always occur on my "server1". After gradually migrating users to server2 it now occur on "server2". Very strangely the problem never occur two times in a short period of time. This may me think about some memory leaking or cache/swaping problem. I will try to reboot the servers every days to see if this change something. I have approximately 40 "receive_cb_reply: Got unrecognized reply: calldir" messages per weeks on each servers. But these messages not always produce the crash. But there is always one or two "receive_cb_reply: Got unrecognized reply: calldir" messages before the crash. Like this : (crash 1) 2024-11-07T17:43:33.879937+01:00 fichdc01 kernel: [372607.103736] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 00000000639ae95e xid c9c4c5ef 2024-11-07T17:43:33.879942+01:00 fichdc01 kernel: [372607.103760] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 00000000639ae95e xid c8c4c5ef 2024-11-07T17:46:07.480005+01:00 fichdc01 kernel: [372760.700382] INFO: task nfsd:1376 blocked for more than 120 seconds. (crash 2) 2024-11-15T10:12:25.053735+01:00 fichdc01 kernel: [450557.120399] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 000000005ab3e3a5 xid bf5c5798 2024-11-15T10:12:25.053755+01:00 fichdc01 kernel: [450557.120616] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 000000005ab3e3a5 xid c05c5798 2024-11-15T10:14:43.805798+01:00 fichdc01 kernel: [450695.869270] INFO: task nfsd:1357 blocked for more than 120 seconds. (crash 3) 2024-11-22T09:17:47.855807+01:00 fichdc01 kernel: [224734.495096] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 000000008cac1606 xid b17462f6 2024-11-22T09:19:58.535823+01:00 fichdc01 kernel: [224865.170751] INFO: task nfsd:1438 blocked for more than 120 seconds. (crash 4) 2024-11-29T16:06:00.541594+01:00 fichdc02 kernel: [240859.889516] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 000000004d6a097d xid f86a9543 2024-11-29T16:06:00.541622+01:00 fichdc02 kernel: [240859.890673] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 000000004d6a097d xid f96a9543 2024-11-29T16:09:13.053724+01:00 fichdc02 kernel: [241052.394494] INFO: task nfsd:1733 blocked for more than 120 seconds. Follow 8~10 "nfsd blocked" messages. Every 120 seconds : 2024-11-29T16:09:13.053724+01:00 fichdc02 kernel: [241052.394494] INFO: task nfsd:1733 blocked for more than 120 seconds. 2024-11-29T16:09:13.054029+01:00 fichdc02 kernel: [241052.398245] INFO: task nfsd:1734 blocked for more than 120 seconds. 2024-11-29T16:09:13.057722+01:00 fichdc02 kernel: [241052.400733] INFO: task nfsd:1735 blocked for more than 120 seconds. 2024-11-29T16:11:13.885945+01:00 fichdc02 kernel: [241173.222137] INFO: task nfsd:1732 blocked for more than 120 seconds. 2024-11-29T16:11:13.886106+01:00 fichdc02 kernel: [241173.224428] INFO: task nfsd:1733 blocked for more than 241 seconds. 2024-11-29T16:11:13.890152+01:00 fichdc02 kernel: [241173.226583] INFO: task nfsd:1734 blocked for more than 241 seconds. 2024-11-29T16:11:13.890241+01:00 fichdc02 kernel: [241173.228945] INFO: task nfsd:1735 blocked for more than 241 seconds. Strangely sometimes, some clients that have already opened an NFS session can still access the server during about 30 minutes. But no new connections are allowed. After 30 minutes everything is blocked. And I can't restart the server normally or kill nfsd. I use sysrq to reboot immediately the servers... Sometimes the server access is stopped immediately after the "nfsd blocked" error messages. The first "nfsd blocked message" is always related to "nfsd4_destroy_session". The follow are related to "nfsd4_destroy_session", "nfsd4_create_session", or "nfsd4_shutdown_callback". Since 6.1.115 have have also some kworker error messages like this : INFO: task kworker/u96:2:39983 blocked for more than 120 seconds. Not tainted 6.1.0-28-amd64 #1 Debian 6.1.119-1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u96:2 state:D stack:0 pid:39983 ppid:2 flags:0x00004000 Workqueue: nfsd4 laundromat_main [nfsd] Call Trace: <TASK> __schedule+0x34d/0x9e0 schedule+0x5a/0xd0 schedule_timeout+0x118/0x150 wait_for_completion+0x86/0x160 __flush_workqueue+0x152/0x420 nfsd4_shutdown_callback+0x49/0x130 [nfsd] ? _raw_spin_unlock+0xa/0x30 ? nfsd4_return_all_client_layouts+0xc4/0xf0 [nfsd] ? nfsd4_shutdown_copy+0x28/0x130 [nfsd] __destroy_client+0x1f3/0x290 [nfsd] nfs4_process_client_reaplist+0xa2/0x110 [nfsd] laundromat_main+0x1ce/0x880 [nfsd] process_one_work+0x1c7/0x380 worker_thread+0x4d/0x380 ? rescuer_thread+0x3a0/0x3a0 kthread+0xda/0x100 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x22/0x30 </TASK> task:kworker/u96:3 state:D stack:0 pid:40084 ppid:2 flags:0x00004000 Workqueue: nfsd4 nfsd4_state_shrinker_worker [nfsd] Call Trace: <TASK> __schedule+0x34d/0x9e0 schedule+0x5a/0xd0 schedule_timeout+0x118/0x150 wait_for_completion+0x86/0x160 __flush_workqueue+0x152/0x420 nfsd4_shutdown_callback+0x49/0x130 [nfsd] ? _raw_spin_unlock+0xa/0x30 ? nfsd4_return_all_client_layouts+0xc4/0xf0 [nfsd] ? nfsd4_shutdown_copy+0x28/0x130 [nfsd] __destroy_client+0x1f3/0x290 [nfsd] nfs4_process_client_reaplist+0xa2/0x110 [nfsd] nfsd4_state_shrinker_worker+0xf7/0x320 [nfsd] process_one_work+0x1c7/0x380 worker_thread+0x4d/0x380 ? rescuer_thread+0x3a0/0x3a0 kthread+0xda/0x100 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x22/0x30 </TASK> I can do more investigation if someone give me something to do. Actually I have no more ideas. And I don't now how determine the client causing the "unrecognized reply" messages. Here some other links talking about this bug : https://lore.kernel.org/all/987ec8b2-40da-4745-95c2-8ffef061c...@aixigo.com/T/ https://forum.openmediavault.org/index.php?thread/52851-nfs-crash/ https://forum.proxmox.com/threads/kernel-6-8-x-nfs-server-bug.154272/ https://forums.truenas.com/t/truenas-nfs-random-crash/9200/34 Regards, Baptiste.