Hi all, On Mon, Dec 02, 2024 at 10:25:09AM +0100, Pellegrin Baptiste wrote: > Hello. > > I try to address this bug without success for 4 months since I upgraded my > two file servers to Debian Bookworm on August 2024. Finding a solution is > critical for me as I manage a high school network where home directories are > shared by NFS and actually I have a crash every week. But my situation may > help to find what's wrong because the crash occur relatively often. > > Here my current investigation. > > The three last stable Debian Linux kernels seems all affected by this bug on > the server side : 6.1.112-1, 6.1.115-1, 6.1.119+1. I have not tested any > previous Bookworm version actually. Is difficult for me to give the exact > client kernel version as I have around 450 Debian Bookworm Desktop all > configured with automatic upgrades. But they may be not rebooted/powered on > for a long time. And I don't know actually how to determine witch client > causing the crash. > > > My two servers have completely different hardware and one is bare metal and > the other is virtualized. So it seems not an hardware related problem. > > The crash always occur when there is some load on the servers. But there is > no need to very high load. Sometimes the problem occur with very few students > working (around 75 clients). > > Very strangely, in my case, the problem occur exactly one time per week, on > one server. So I first thought about a log rotation problem. But I didn't > find any clues in this direction. > > Load balancing don't resolve the issue. At first the problem always occur on > my "server1". After gradually migrating users to server2 it now occur on > "server2". > > Very strangely the problem never occur two times in a short period of time. > This may me think about some memory leaking or cache/swaping problem. I will > try to reboot the servers every days to see if this change something. > > I have approximately 40 "receive_cb_reply: Got unrecognized reply: calldir" > messages per weeks on each servers. But these messages not always produce the > crash. But there is always one or two "receive_cb_reply: Got unrecognized > reply: calldir" messages before the crash. Like this : > > (crash 1) > 2024-11-07T17:43:33.879937+01:00 fichdc01 kernel: [372607.103736] > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > 00000000639ae95e xid c9c4c5ef > 2024-11-07T17:43:33.879942+01:00 fichdc01 kernel: [372607.103760] > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > 00000000639ae95e xid c8c4c5ef > 2024-11-07T17:46:07.480005+01:00 fichdc01 kernel: [372760.700382] INFO: task > nfsd:1376 blocked for more than 120 seconds. > > (crash 2) > 2024-11-15T10:12:25.053735+01:00 fichdc01 kernel: [450557.120399] > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > 000000005ab3e3a5 xid bf5c5798 > 2024-11-15T10:12:25.053755+01:00 fichdc01 kernel: [450557.120616] > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > 000000005ab3e3a5 xid c05c5798 > 2024-11-15T10:14:43.805798+01:00 fichdc01 kernel: [450695.869270] INFO: task > nfsd:1357 blocked for more than 120 seconds. > > (crash 3) > 2024-11-22T09:17:47.855807+01:00 fichdc01 kernel: [224734.495096] > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > 000000008cac1606 xid b17462f6 > 2024-11-22T09:19:58.535823+01:00 fichdc01 kernel: [224865.170751] INFO: task > nfsd:1438 blocked for more than 120 seconds. > > (crash 4) > 2024-11-29T16:06:00.541594+01:00 fichdc02 kernel: [240859.889516] > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > 000000004d6a097d xid f86a9543 > 2024-11-29T16:06:00.541622+01:00 fichdc02 kernel: [240859.890673] > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > 000000004d6a097d xid f96a9543 > 2024-11-29T16:09:13.053724+01:00 fichdc02 kernel: [241052.394494] INFO: task > nfsd:1733 blocked for more than 120 seconds. > > > Follow 8~10 "nfsd blocked" messages. Every 120 seconds : > > 2024-11-29T16:09:13.053724+01:00 fichdc02 kernel: [241052.394494] INFO: task > nfsd:1733 blocked for more than 120 seconds. > 2024-11-29T16:09:13.054029+01:00 fichdc02 kernel: [241052.398245] INFO: task > nfsd:1734 blocked for more than 120 seconds. > 2024-11-29T16:09:13.057722+01:00 fichdc02 kernel: [241052.400733] INFO: task > nfsd:1735 blocked for more than 120 seconds. > 2024-11-29T16:11:13.885945+01:00 fichdc02 kernel: [241173.222137] INFO: task > nfsd:1732 blocked for more than 120 seconds. > 2024-11-29T16:11:13.886106+01:00 fichdc02 kernel: [241173.224428] INFO: task > nfsd:1733 blocked for more than 241 seconds. > 2024-11-29T16:11:13.890152+01:00 fichdc02 kernel: [241173.226583] INFO: task > nfsd:1734 blocked for more than 241 seconds. > 2024-11-29T16:11:13.890241+01:00 fichdc02 kernel: [241173.228945] INFO: task > nfsd:1735 blocked for more than 241 seconds. > > Strangely sometimes, some clients that have already opened an NFS session can > still access the server during about 30 minutes. But no new connections are > allowed. After 30 minutes everything is blocked. And I can't restart the > server normally or kill nfsd. I use sysrq to reboot immediately the servers... > > Sometimes the server access is stopped immediately after the "nfsd blocked" > error messages. > > The first "nfsd blocked message" is always related to > "nfsd4_destroy_session". The follow are related to "nfsd4_destroy_session", > "nfsd4_create_session", or "nfsd4_shutdown_callback". > > Since 6.1.115 have have also some kworker error messages like this : > > INFO: task kworker/u96:2:39983 blocked for more than 120 seconds. > Not tainted 6.1.0-28-amd64 #1 Debian 6.1.119-1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:kworker/u96:2 state:D stack:0 pid:39983 ppid:2 flags:0x00004000 > Workqueue: nfsd4 laundromat_main [nfsd] > Call Trace: > <TASK> > __schedule+0x34d/0x9e0 > schedule+0x5a/0xd0 > schedule_timeout+0x118/0x150 > wait_for_completion+0x86/0x160 > __flush_workqueue+0x152/0x420 > nfsd4_shutdown_callback+0x49/0x130 [nfsd] > ? _raw_spin_unlock+0xa/0x30 > ? nfsd4_return_all_client_layouts+0xc4/0xf0 [nfsd] > ? nfsd4_shutdown_copy+0x28/0x130 [nfsd] > __destroy_client+0x1f3/0x290 [nfsd] > nfs4_process_client_reaplist+0xa2/0x110 [nfsd] > laundromat_main+0x1ce/0x880 [nfsd] > process_one_work+0x1c7/0x380 > worker_thread+0x4d/0x380 > ? rescuer_thread+0x3a0/0x3a0 > kthread+0xda/0x100 > ? kthread_complete_and_exit+0x20/0x20 > ret_from_fork+0x22/0x30 > </TASK> > > task:kworker/u96:3 state:D stack:0 pid:40084 ppid:2 flags:0x00004000 > Workqueue: nfsd4 nfsd4_state_shrinker_worker [nfsd] > Call Trace: > <TASK> > __schedule+0x34d/0x9e0 > schedule+0x5a/0xd0 > schedule_timeout+0x118/0x150 > wait_for_completion+0x86/0x160 > __flush_workqueue+0x152/0x420 > nfsd4_shutdown_callback+0x49/0x130 [nfsd] > ? _raw_spin_unlock+0xa/0x30 > ? nfsd4_return_all_client_layouts+0xc4/0xf0 [nfsd] > ? nfsd4_shutdown_copy+0x28/0x130 [nfsd] > __destroy_client+0x1f3/0x290 [nfsd] > nfs4_process_client_reaplist+0xa2/0x110 [nfsd] > nfsd4_state_shrinker_worker+0xf7/0x320 [nfsd] > process_one_work+0x1c7/0x380 > worker_thread+0x4d/0x380 > ? rescuer_thread+0x3a0/0x3a0 > kthread+0xda/0x100 > ? kthread_complete_and_exit+0x20/0x20 > ret_from_fork+0x22/0x30 > </TASK> > > I can do more investigation if someone give me something to do. Actually I > have no more ideas. And I don't now how determine the client causing the > "unrecognized reply" messages. > > Here some other links talking about this bug : > > https://lore.kernel.org/all/987ec8b2-40da-4745-95c2-8ffef061c...@aixigo.com/T/ > https://forum.openmediavault.org/index.php?thread/52851-nfs-crash/ > https://forum.proxmox.com/threads/kernel-6-8-x-nfs-server-bug.154272/ > https://forums.truenas.com/t/truenas-nfs-random-crash/9200/34
I followed up upstream in https://lore.kernel.org/linux-nfs/z2vnq6hxfg_lq...@eldamar.lan/T/#u and related there seem to be as well https://lore.kernel.org/linux-nfs/853bd2973f751e681476d320f23d47332d2bf41a.ca...@kernel.org/ . Regards, Salvatore