Hi all,

On Mon, Dec 02, 2024 at 10:25:09AM +0100, Pellegrin Baptiste wrote:
> Hello.
> 
> I try to address this bug without success for 4 months since I upgraded my 
> two file servers to Debian Bookworm on August 2024. Finding a solution is 
> critical for me as I manage a high school network where home directories are 
> shared by NFS and actually I have a crash every week. But my situation may 
> help to find what's wrong because the crash occur relatively often.
> 
> Here my current investigation.
> 
> The three last stable Debian Linux kernels seems all affected by this bug on 
> the server side : 6.1.112-1, 6.1.115-1, 6.1.119+1. I have not tested any 
> previous Bookworm version actually. Is difficult for me to give the exact 
> client kernel version as I have around 450 Debian Bookworm Desktop all 
> configured with automatic upgrades. But they may be not rebooted/powered on 
> for a long time. And I don't know actually how to determine witch client 
> causing the crash.
> 
> 
> My two servers have completely different hardware and one is bare metal and 
> the other is virtualized. So it seems not an hardware related problem.
> 
> The crash always occur when there is some load on the servers. But there is 
> no need to very high load. Sometimes the problem occur with very few students 
> working (around 75 clients).
> 
> Very strangely, in my case, the problem occur exactly one time per week, on 
> one server. So I first thought about a log rotation problem. But I didn't 
> find any clues in this direction.
> 
> Load balancing don't resolve the issue. At first the problem always occur on 
> my "server1". After gradually migrating users to server2 it now occur on 
> "server2". 
> 
> Very strangely the problem never occur two times in a short period of time. 
> This may me think about some memory leaking or cache/swaping problem. I will 
> try to reboot the servers every days to see if this change something.
> 
> I have approximately 40 "receive_cb_reply: Got unrecognized reply: calldir" 
> messages per weeks on each servers. But these messages not always produce the 
> crash. But there is always one or two "receive_cb_reply: Got unrecognized 
> reply: calldir" messages before the crash. Like this :
> 
> (crash 1)
> 2024-11-07T17:43:33.879937+01:00 fichdc01 kernel: [372607.103736] 
> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 
> 00000000639ae95e xid c9c4c5ef
> 2024-11-07T17:43:33.879942+01:00 fichdc01 kernel: [372607.103760] 
> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 
> 00000000639ae95e xid c8c4c5ef
> 2024-11-07T17:46:07.480005+01:00 fichdc01 kernel: [372760.700382] INFO: task 
> nfsd:1376 blocked for more than 120 seconds. 
> 
> (crash 2)
> 2024-11-15T10:12:25.053735+01:00 fichdc01 kernel: [450557.120399] 
> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 
> 000000005ab3e3a5 xid bf5c5798
> 2024-11-15T10:12:25.053755+01:00 fichdc01 kernel: [450557.120616] 
> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 
> 000000005ab3e3a5 xid c05c5798
> 2024-11-15T10:14:43.805798+01:00 fichdc01 kernel: [450695.869270] INFO: task 
> nfsd:1357 blocked for more than 120 seconds.
> 
> (crash 3)
> 2024-11-22T09:17:47.855807+01:00 fichdc01 kernel: [224734.495096] 
> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 
> 000000008cac1606 xid b17462f6
> 2024-11-22T09:19:58.535823+01:00 fichdc01 kernel: [224865.170751] INFO: task 
> nfsd:1438 blocked for more than 120 seconds.
> 
> (crash 4)
> 2024-11-29T16:06:00.541594+01:00 fichdc02 kernel: [240859.889516] 
> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 
> 000000004d6a097d xid f86a9543
> 2024-11-29T16:06:00.541622+01:00 fichdc02 kernel: [240859.890673] 
> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 
> 000000004d6a097d xid f96a9543
> 2024-11-29T16:09:13.053724+01:00 fichdc02 kernel: [241052.394494] INFO: task 
> nfsd:1733 blocked for more than 120 seconds.
> 
> 
> Follow 8~10 "nfsd blocked" messages. Every 120 seconds :
> 
> 2024-11-29T16:09:13.053724+01:00 fichdc02 kernel: [241052.394494] INFO: task 
> nfsd:1733 blocked for more than 120 seconds.
> 2024-11-29T16:09:13.054029+01:00 fichdc02 kernel: [241052.398245] INFO: task 
> nfsd:1734 blocked for more than 120 seconds.
> 2024-11-29T16:09:13.057722+01:00 fichdc02 kernel: [241052.400733] INFO: task 
> nfsd:1735 blocked for more than 120 seconds.
> 2024-11-29T16:11:13.885945+01:00 fichdc02 kernel: [241173.222137] INFO: task 
> nfsd:1732 blocked for more than 120 seconds.
> 2024-11-29T16:11:13.886106+01:00 fichdc02 kernel: [241173.224428] INFO: task 
> nfsd:1733 blocked for more than 241 seconds.
> 2024-11-29T16:11:13.890152+01:00 fichdc02 kernel: [241173.226583] INFO: task 
> nfsd:1734 blocked for more than 241 seconds.
> 2024-11-29T16:11:13.890241+01:00 fichdc02 kernel: [241173.228945] INFO: task 
> nfsd:1735 blocked for more than 241 seconds. 
> 
> Strangely sometimes, some clients that have already opened an NFS session can 
> still access the server during about 30 minutes. But no new connections are 
> allowed. After 30 minutes everything is blocked. And I can't restart the 
> server normally or kill nfsd. I use sysrq to reboot immediately the servers...
> 
> Sometimes the server access is stopped immediately after the "nfsd blocked" 
> error messages.
> 
> The first "nfsd blocked message" is always related to 
> "nfsd4_destroy_session". The follow are related to "nfsd4_destroy_session", 
> "nfsd4_create_session", or "nfsd4_shutdown_callback".
> 
> Since 6.1.115 have have also some kworker error messages like this :
> 
> INFO: task kworker/u96:2:39983 blocked for more than 120 seconds.
> Not tainted 6.1.0-28-amd64 #1 Debian 6.1.119-1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/u96:2 state:D stack:0 pid:39983 ppid:2 flags:0x00004000
> Workqueue: nfsd4 laundromat_main [nfsd]
> Call Trace:
> <TASK>
> __schedule+0x34d/0x9e0
> schedule+0x5a/0xd0
> schedule_timeout+0x118/0x150
> wait_for_completion+0x86/0x160
> __flush_workqueue+0x152/0x420
> nfsd4_shutdown_callback+0x49/0x130 [nfsd]
> ? _raw_spin_unlock+0xa/0x30
> ? nfsd4_return_all_client_layouts+0xc4/0xf0 [nfsd]
> ? nfsd4_shutdown_copy+0x28/0x130 [nfsd]
> __destroy_client+0x1f3/0x290 [nfsd]
> nfs4_process_client_reaplist+0xa2/0x110 [nfsd]
> laundromat_main+0x1ce/0x880 [nfsd]
> process_one_work+0x1c7/0x380
> worker_thread+0x4d/0x380
> ? rescuer_thread+0x3a0/0x3a0
> kthread+0xda/0x100
> ? kthread_complete_and_exit+0x20/0x20
> ret_from_fork+0x22/0x30
> </TASK>
> 
> task:kworker/u96:3 state:D stack:0 pid:40084 ppid:2 flags:0x00004000
> Workqueue: nfsd4 nfsd4_state_shrinker_worker [nfsd]
> Call Trace:
> <TASK>
> __schedule+0x34d/0x9e0
> schedule+0x5a/0xd0
> schedule_timeout+0x118/0x150
> wait_for_completion+0x86/0x160
> __flush_workqueue+0x152/0x420
> nfsd4_shutdown_callback+0x49/0x130 [nfsd]
> ? _raw_spin_unlock+0xa/0x30
> ? nfsd4_return_all_client_layouts+0xc4/0xf0 [nfsd]
> ? nfsd4_shutdown_copy+0x28/0x130 [nfsd]
> __destroy_client+0x1f3/0x290 [nfsd]
> nfs4_process_client_reaplist+0xa2/0x110 [nfsd]
> nfsd4_state_shrinker_worker+0xf7/0x320 [nfsd]
> process_one_work+0x1c7/0x380
> worker_thread+0x4d/0x380
> ? rescuer_thread+0x3a0/0x3a0
> kthread+0xda/0x100
> ? kthread_complete_and_exit+0x20/0x20
> ret_from_fork+0x22/0x30
> </TASK>
> 
> I can do more investigation if someone give me something to do. Actually I 
> have no more ideas. And I don't now how determine the client causing the 
> "unrecognized reply" messages.
> 
> Here some other links talking about this bug :
> 
> https://lore.kernel.org/all/987ec8b2-40da-4745-95c2-8ffef061c...@aixigo.com/T/
> https://forum.openmediavault.org/index.php?thread/52851-nfs-crash/
> https://forum.proxmox.com/threads/kernel-6-8-x-nfs-server-bug.154272/
> https://forums.truenas.com/t/truenas-nfs-random-crash/9200/34

I followed up upstream in
https://lore.kernel.org/linux-nfs/z2vnq6hxfg_lq...@eldamar.lan/T/#u
and related there seem to be as well
https://lore.kernel.org/linux-nfs/853bd2973f751e681476d320f23d47332d2bf41a.ca...@kernel.org/
.

Regards,
Salvatore

Reply via email to