Hello.

I try to address this bug without success for 4 months since I upgraded my two 
file servers to Debian Bookworm on August 2024. Finding a solution is critical 
for me as I manage a high school network where home directories are shared by 
NFS and actually I have a crash every week. But my situation may help to find 
what's wrong because the crash occur relatively often.

Here my current investigation.

The three last stable Debian Linux kernels seems all affected by this bug on 
the server side : 6.1.112-1, 6.1.115-1, 6.1.119+1. I have not tested any 
previous Bookworm version actually. Is difficult for me to give the exact 
client kernel version as I have around 450 Debian Bookworm Desktop all 
configured with automatic upgrades. But they may be not rebooted/powered on for 
a long time. And I don't know actually how to determine witch client causing 
the crash.


My two servers have completely different hardware and one is bare metal and the 
other is virtualized. So it seems not an hardware related problem.

The crash always occur when there is some load on the servers. But there is no 
need to very high load. Sometimes the problem occur with very few students 
working (around 75 clients).

Very strangely, in my case, the problem occur exactly one time per week, on one 
server. So I first thought about a log rotation problem. But I didn't find any 
clues in this direction.

Load balancing don't resolve the issue. At first the problem always occur on my 
"server1". After gradually migrating users to server2 it now occur on 
"server2". 

Very strangely the problem never occur two times in a short period of time. 
This may me think about some memory leaking or cache/swaping problem. I will 
try to reboot the servers every days to see if this change something.

I have approximately 40 "receive_cb_reply: Got unrecognized reply: calldir" 
messages per weeks on each servers. But these messages not always produce the 
crash. But there is always one or two "receive_cb_reply: Got unrecognized 
reply: calldir" messages before the crash. Like this :

(crash 1)
2024-11-07T17:43:33.879937+01:00 fichdc01 kernel: [372607.103736] 
receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 
00000000639ae95e xid c9c4c5ef
2024-11-07T17:43:33.879942+01:00 fichdc01 kernel: [372607.103760] 
receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 
00000000639ae95e xid c8c4c5ef
2024-11-07T17:46:07.480005+01:00 fichdc01 kernel: [372760.700382] INFO: task 
nfsd:1376 blocked for more than 120 seconds. 

(crash 2)
2024-11-15T10:12:25.053735+01:00 fichdc01 kernel: [450557.120399] 
receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 
000000005ab3e3a5 xid bf5c5798
2024-11-15T10:12:25.053755+01:00 fichdc01 kernel: [450557.120616] 
receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 
000000005ab3e3a5 xid c05c5798
2024-11-15T10:14:43.805798+01:00 fichdc01 kernel: [450695.869270] INFO: task 
nfsd:1357 blocked for more than 120 seconds.

(crash 3)
2024-11-22T09:17:47.855807+01:00 fichdc01 kernel: [224734.495096] 
receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 
000000008cac1606 xid b17462f6
2024-11-22T09:19:58.535823+01:00 fichdc01 kernel: [224865.170751] INFO: task 
nfsd:1438 blocked for more than 120 seconds.

(crash 4)
2024-11-29T16:06:00.541594+01:00 fichdc02 kernel: [240859.889516] 
receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 
000000004d6a097d xid f86a9543
2024-11-29T16:06:00.541622+01:00 fichdc02 kernel: [240859.890673] 
receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 
000000004d6a097d xid f96a9543
2024-11-29T16:09:13.053724+01:00 fichdc02 kernel: [241052.394494] INFO: task 
nfsd:1733 blocked for more than 120 seconds.


Follow 8~10 "nfsd blocked" messages. Every 120 seconds :

2024-11-29T16:09:13.053724+01:00 fichdc02 kernel: [241052.394494] INFO: task 
nfsd:1733 blocked for more than 120 seconds.
2024-11-29T16:09:13.054029+01:00 fichdc02 kernel: [241052.398245] INFO: task 
nfsd:1734 blocked for more than 120 seconds.
2024-11-29T16:09:13.057722+01:00 fichdc02 kernel: [241052.400733] INFO: task 
nfsd:1735 blocked for more than 120 seconds.
2024-11-29T16:11:13.885945+01:00 fichdc02 kernel: [241173.222137] INFO: task 
nfsd:1732 blocked for more than 120 seconds.
2024-11-29T16:11:13.886106+01:00 fichdc02 kernel: [241173.224428] INFO: task 
nfsd:1733 blocked for more than 241 seconds.
2024-11-29T16:11:13.890152+01:00 fichdc02 kernel: [241173.226583] INFO: task 
nfsd:1734 blocked for more than 241 seconds.
2024-11-29T16:11:13.890241+01:00 fichdc02 kernel: [241173.228945] INFO: task 
nfsd:1735 blocked for more than 241 seconds. 

Strangely sometimes, some clients that have already opened an NFS session can 
still access the server during about 30 minutes. But no new connections are 
allowed. After 30 minutes everything is blocked. And I can't restart the server 
normally or kill nfsd. I use sysrq to reboot immediately the servers...

Sometimes the server access is stopped immediately after the "nfsd blocked" 
error messages.

The first "nfsd blocked message" is always related to "nfsd4_destroy_session". 
The follow are related to "nfsd4_destroy_session", "nfsd4_create_session", or 
"nfsd4_shutdown_callback".

Since 6.1.115 have have also some kworker error messages like this :

INFO: task kworker/u96:2:39983 blocked for more than 120 seconds.
Not tainted 6.1.0-28-amd64 #1 Debian 6.1.119-1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u96:2 state:D stack:0 pid:39983 ppid:2 flags:0x00004000
Workqueue: nfsd4 laundromat_main [nfsd]
Call Trace:
<TASK>
__schedule+0x34d/0x9e0
schedule+0x5a/0xd0
schedule_timeout+0x118/0x150
wait_for_completion+0x86/0x160
__flush_workqueue+0x152/0x420
nfsd4_shutdown_callback+0x49/0x130 [nfsd]
? _raw_spin_unlock+0xa/0x30
? nfsd4_return_all_client_layouts+0xc4/0xf0 [nfsd]
? nfsd4_shutdown_copy+0x28/0x130 [nfsd]
__destroy_client+0x1f3/0x290 [nfsd]
nfs4_process_client_reaplist+0xa2/0x110 [nfsd]
laundromat_main+0x1ce/0x880 [nfsd]
process_one_work+0x1c7/0x380
worker_thread+0x4d/0x380
? rescuer_thread+0x3a0/0x3a0
kthread+0xda/0x100
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x22/0x30
</TASK>

task:kworker/u96:3 state:D stack:0 pid:40084 ppid:2 flags:0x00004000
Workqueue: nfsd4 nfsd4_state_shrinker_worker [nfsd]
Call Trace:
<TASK>
__schedule+0x34d/0x9e0
schedule+0x5a/0xd0
schedule_timeout+0x118/0x150
wait_for_completion+0x86/0x160
__flush_workqueue+0x152/0x420
nfsd4_shutdown_callback+0x49/0x130 [nfsd]
? _raw_spin_unlock+0xa/0x30
? nfsd4_return_all_client_layouts+0xc4/0xf0 [nfsd]
? nfsd4_shutdown_copy+0x28/0x130 [nfsd]
__destroy_client+0x1f3/0x290 [nfsd]
nfs4_process_client_reaplist+0xa2/0x110 [nfsd]
nfsd4_state_shrinker_worker+0xf7/0x320 [nfsd]
process_one_work+0x1c7/0x380
worker_thread+0x4d/0x380
? rescuer_thread+0x3a0/0x3a0
kthread+0xda/0x100
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x22/0x30
</TASK>

I can do more investigation if someone give me something to do. Actually I have 
no more ideas. And I don't now how determine the client causing the 
"unrecognized reply" messages.

Here some other links talking about this bug :

https://lore.kernel.org/all/987ec8b2-40da-4745-95c2-8ffef061c...@aixigo.com/T/
https://forum.openmediavault.org/index.php?thread/52851-nfs-crash/
https://forum.proxmox.com/threads/kernel-6-8-x-nfs-server-bug.154272/
https://forums.truenas.com/t/truenas-nfs-random-crash/9200/34

Regards,

Baptiste.

Reply via email to