We installed the unofficial kernel 6.8.0-46-generic-nfs on several NFS client 
servers on Saturday and have been testing it with high IO loads since then.
Unfortunately the server crashed again after about 40 hours with "rcu: INFO: 
rcu_sched self-detected stall on CPU". 
The kernel 6.8.0-46-generic-nfs prevents the error message "RPC: Could not send 
backchannel reply error: -110", 
but not the crashs which we have been struggling with since August 19th 
switching the kernel from 6.5.0-44-generic to 6.8.0-40-generic.

Our experiences with NFS server crashes are:
- We were able to reproduce the crashes in our production and test 
environments. Rarely after minutes, sometimes after hours or days, but 
sometimes not at all, 
  as we often stopped the experiments after 12 to 24 hours.
- We have not yet been able to reproduce a crash between a bare metal NFS 
server and a bare metal NFS client, but between a bare metal NFS server and a 
virtualized client.
- we could not reproduce a crash with NFS vers=4.0 
- the crashs happens with or without GSSPROXY

Our setup:
- virtualized NFS 4.2 server with Ubuntu 22.04.5 LTS and kernel 
5.15.0-122-generic
- virtualized NFS client with Ubuntu 22.04.5 LTS and kernel 6.8.0-40-generic or 
kernel 6.8.0-45-generic
- /etc/exports :  /mnt/home  
nfsclient(sec=krb5,rw,root_squash,sync,no_subtree_check)
- /etc/fstab :  nfsserver:/mnt/home /home   nfs    
vers=4.2,rw,soft,sec=krb5,proto=tcp  0  0
- apt info nfs-common : Version: 1:2.6.1-1ubuntu1.2

syslog of NFS server after crash:
Sep 30 01:15:51 nfs-server.domain.de kernel: rcu: INFO: rcu_sched self-detected 
stall on CPU
Sep 30 01:15:51 nfs-server.domain.de kernel: rcu:         54-....: (14998 ticks 
this GP) idle=2db/1/0x4000000000000000 softirq=32173387/32173387 fqs=7449
Sep 30 01:15:51 nfs-server.domain.de kernel:         (t=15000 jiffies 
g=144775177 q=49782)
Sep 30 01:15:51 nfs-server.domain.de kernel: NMI backtrace for cpu 54
Sep 30 01:15:51 nfs-server.domain.de kernel: CPU: 54 PID: 153154 Comm: 
kworker/u480:36 Not tainted 5.15.0-122-generic #132-Ubuntu
Sep 30 01:15:51 nfs-server.domain.de kernel: Hardware name: Microsoft 
Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 
12/17/2019
Sep 30 01:15:51 nfs-server.domain.de kernel: Workqueue: rpciod 
rpc_async_schedule [sunrpc]
Sep 30 01:15:51 nfs-server.domain.de kernel: Call Trace:
Sep 30 01:15:51 nfs-server.domain.de kernel:  <IRQ>
Sep 30 01:15:51 nfs-server.domain.de kernel:  show_stack+0x52/0x5c
Sep 30 01:15:51 nfs-server.domain.de kernel:  dump_stack_lvl+0x4a/0x63
Sep 30 01:15:51 nfs-server.domain.de kernel:  dump_stack+0x10/0x16
Sep 30 01:15:51 nfs-server.domain.de kernel:  nmi_cpu_backtrace.cold+0x4d/0x93
Sep 30 01:15:51 nfs-server.domain.de kernel:  ? lapic_can_unplug_cpu+0x90/0x90
Sep 30 01:15:51 nfs-server.domain.de kernel:  
nmi_trigger_cpumask_backtrace+0xec/0x100
Sep 30 01:15:51 nfs-server.domain.de kernel:  
arch_trigger_cpumask_backtrace+0x19/0x20
Sep 30 01:15:51 nfs-server.domain.de kernel:  
trigger_single_cpu_backtrace+0x44/0x4f
Sep 30 01:15:51 nfs-server.domain.de kernel:  rcu_dump_cpu_stacks+0x102/0x149
Sep 30 01:15:51 nfs-server.domain.de kernel:  print_cpu_stall.cold+0x2f/0xe2
Sep 30 01:15:51 nfs-server.domain.de kernel:  check_cpu_stall+0x1d8/0x270
Sep 30 01:15:51 nfs-server.domain.de kernel:  rcu_sched_clock_irq+0x9a/0x250
Sep 30 01:15:51 nfs-server.domain.de kernel:  update_process_times+0x94/0xd0
Sep 30 01:15:51 nfs-server.domain.de kernel:  tick_sched_handle+0x29/0x70
Sep 30 01:15:51 nfs-server.domain.de kernel:  tick_sched_timer+0x6f/0x90
Sep 30 01:15:51 nfs-server.domain.de kernel:  ? tick_sched_do_timer+0xa0/0xa0
Sep 30 01:15:51 nfs-server.domain.de kernel:  __hrtimer_run_queues+0x104/0x230
Sep 30 01:15:51 nfs-server.domain.de kernel:  ? read_hv_clock_tsc_cs+0x9/0x30
Sep 30 01:15:51 nfs-server.domain.de kernel:  hrtimer_interrupt+0x101/0x220
Sep 30 01:15:51 nfs-server.domain.de kernel:  hv_stimer0_isr+0x1d/0x30
Sep 30 01:15:51 nfs-server.domain.de kernel:  __sysvec_hyperv_stimer0+0x2f/0x70
Sep 30 01:15:51 nfs-server.domain.de kernel:  sysvec_hyperv_stimer0+0x7b/0x90
Sep 30 01:15:51 nfs-server.domain.de kernel:  </IRQ>
Sep 30 01:15:51 nfs-server.domain.de kernel:  <TASK>
Sep 30 01:15:51 nfs-server.domain.de kernel:  
asm_sysvec_hyperv_stimer0+0x1b/0x20
Sep 30 01:15:51 nfs-server.domain.de kernel: RIP: 
0010:read_hv_clock_tsc+0x1b/0x60
Sep 30 01:15:51 nfs-server.domain.de kernel: Code: eb bc 66 66 2e 0f 1f 84 00 
00 00 00 00 66 90 8b 35 2a 89 97 02 85 f6 74 38 4c 8b 05 27 89 97 02 48 8b 3d 
28 89 97 02 0f 01 f9 <66> 90 8b 0d 0d 89 97 02 39 ce 75 d9 48 c1 e2 20 48 09 d0 
49 f7 e0
Sep 30 01:15:51 nfs-server.domain.de kernel: RSP: 0018:ffffada44ab33dc8 EFLAGS: 
00000202
Sep 30 01:15:51 nfs-server.domain.de kernel: RAX: 000000005d52dc50 RBX: 
0002197146e8f7ec RCX: 0000000000000036
Sep 30 01:15:51 nfs-server.domain.de kernel: RDX: 00000000000571f0 RSI: 
0000000000000002 RDI: 000000000000000a
Sep 30 01:15:51 nfs-server.domain.de kernel: RBP: ffffada44ab33dd0 R08: 
00fca74eaf6bde68 R09: ffffffffc06265c8
Sep 30 01:15:51 nfs-server.domain.de kernel: R10: 0000000000000003 R11: 
ffff97e3daffe358 R12: 0000000000000000
Sep 30 01:15:51 nfs-server.domain.de kernel: R13: 000000000f685174 R14: 
ffff97e544039d30 R15: 0000000000000001
Sep 30 01:15:51 nfs-server.domain.de kernel:  ? read_hv_clock_tsc_cs+0x9/0x30
Sep 30 01:15:51 nfs-server.domain.de kernel:  ktime_get+0x43/0xc0
Sep 30 01:15:51 nfs-server.domain.de kernel:  rpc_exit_task+0x95/0x110 [sunrpc]
Sep 30 01:15:51 nfs-server.domain.de kernel:  ? 
__rpc_sleep_on_priority+0x80/0x80 [sunrpc]
Sep 30 01:15:51 nfs-server.domain.de kernel:  __rpc_execute+0x65/0x270 [sunrpc]
Sep 30 01:15:51 nfs-server.domain.de kernel:  rpc_async_schedule+0x30/0x50 
[sunrpc]
Sep 30 01:15:51 nfs-server.domain.de kernel:  process_one_work+0x228/0x3d0
Sep 30 01:15:51 nfs-server.domain.de kernel:  worker_thread+0x53/0x420
Sep 30 01:15:51 nfs-server.domain.de kernel:  ? process_one_work+0x3d0/0x3d0
Sep 30 01:15:51 nfs-server.domain.de kernel:  kthread+0x127/0x150
Sep 30 01:15:51 nfs-server.domain.de kernel:  ? set_kthread_struct+0x50/0x50
Sep 30 01:15:51 nfs-server.domain.de kernel:  ret_from_fork+0x1f/0x30
Sep 30 01:15:51 nfs-server.domain.de kernel:  </TASK>
Sep 30 01:17:14 nfs-server.domain.de kernel: watchdog: BUG: soft lockup - 
CPU#54 stuck for 134s! [kworker/u480:36:153154]
Sep 30 01:17:14 nfs-server.domain.de kernel: Modules linked in: tls 
rpcsec_gss_krb5 nfsv4 nfs fscache netfs binfmt_misc xfs nls_iso8859_1 
intel_rapl_msr intel_rapl_common nfit serio_raw hyperv_fb rapl hv_balloon 
joydev mac_hid sch_fq_codel nfsd dm_multipath scsi_dh_rdac scsi_dh_emc 
auth_rpcgss scsi_dh_alua nfs_acl lockd grace msr efi_pstore sunrpc ip_tables 
x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 
async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
libcrc32c raid1 raid0 multipath linear hyperv_drm drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops cec hid_generic rc_core hid_hyperv hv_storvsc 
drm scsi_transport_fc hv_netvsc hid hyperv_keyboard hv_utils crct10dif_pclmul 
crc32_pclmul ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel 
crypto_simd cryptd hv_vmbus
Sep 30 01:17:14 nfs-server.domain.de kernel: CPU: 54 PID: 153154 Comm: 
kworker/u480:36 Not tainted 5.15.0-122-generic #132-Ubuntu
Sep 30 01:17:14 nfs-server.domain.de kernel: Hardware name: Microsoft 
Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 
12/17/2019
Sep 30 01:17:14 nfs-server.domain.de kernel: Workqueue: rpciod 
rpc_async_schedule [sunrpc]
Sep 30 01:17:14 nfs-server.domain.de kernel: RIP: 0010:_raw_spin_lock+0x10/0x30
Sep 30 01:17:14 nfs-server.domain.de kernel: Code: 89 e5 e8 13 63 36 ff 66 90 
5d c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 31 c0 ba 01 
00 00 00 f0 0f b1 17 <75> 05 c3 cc cc cc cc 55 89 c6 48 89 e5 e8 de 62 36 ff 66 
90 5d c3
Sep 30 01:17:14 nfs-server.domain.de kernel: RSP: 0018:ffffada44ab33e20 EFLAGS: 
00000246
Sep 30 01:17:14 nfs-server.domain.de kernel: RAX: 0000000000000000 RBX: 
ffffffffc05da910 RCX: 0000000000000001
Sep 30 01:17:14 nfs-server.domain.de kernel: RDX: 0000000000000001 RSI: 
ffff97e544039d00 RDI: ffffffffc0626540
Sep 30 01:17:14 nfs-server.domain.de kernel: RBP: ffffada44ab33e50 R08: 
0000000000000001 R09: ffffffffc06265c8
Sep 30 01:17:14 nfs-server.domain.de kernel: R10: 0000000000000003 R11: 
ffff97e3daffe358 R12: ffff97e544039d00
Sep 30 01:17:14 nfs-server.domain.de kernel: R13: ffffffffc0626540 R14: 
ffff97e544039d30 R15: 0000000000000001
Sep 30 01:17:14 nfs-server.domain.de kernel: FS:  0000000000000000(0000) 
GS:ffff9818ba380000(0000) knlGS:0000000000000000
Sep 30 01:17:14 nfs-server.domain.de kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
Sep 30 01:17:14 nfs-server.domain.de kernel: CR2: 00007fd9ae6b0240 CR3: 
00000001086fc003 CR4: 00000000003706e0
Sep 30 01:17:14 nfs-server.domain.de kernel: DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
Sep 30 01:17:14 nfs-server.domain.de kernel: DR3: 0000000000000000 DR6: 
00000000fffe0ff0 DR7: 0000000000000400
Sep 30 01:17:14 nfs-server.domain.de kernel: Call Trace:
Sep 30 01:17:14 nfs-server.domain.de kernel:  <IRQ>
Sep 30 01:17:14 nfs-server.domain.de kernel:  ? show_trace_log_lvl+0x1d6/0x2ea
Sep 30 01:17:14 nfs-server.domain.de kernel:  ? show_trace_log_lvl+0x1d6/0x2ea
Sep 30 01:17:14 nfs-server.domain.de kernel:  ? rpc_async_schedule+0x30/0x50 
[sunrpc]
Sep 30 01:17:14 nfs-server.domain.de kernel:  ? show_regs.part.0+0x23/0x29
Sep 30 01:17:14 nfs-server.domain.de kernel:  ? show_regs.cold+0x8/0xd
Sep 30 01:17:14 nfs-server.domain.de kernel:  ? watchdog_timer_fn+0x1be/0x220
Sep 30 01:17:14 nfs-server.domain.de kernel:  ? 
lockup_detector_update_enable+0x60/0x60
Sep 30 01:17:14 nfs-server.domain.de kernel:  ? __hrtimer_run_queues+0x104/0x230
Sep 30 01:17:14 nfs-server.domain.de kernel:  ? read_hv_clock_tsc_cs+0x9/0x30
Sep 30 01:17:14 nfs-server.domain.de kernel:  ? hrtimer_interrupt+0x101/0x220
Sep 30 01:17:14 nfs-server.domain.de kernel:  ? hv_stimer0_isr+0x1d/0x30
Sep 30 01:17:14 nfs-server.domain.de kernel:  ? 
__sysvec_hyperv_stimer0+0x2f/0x70
Sep 30 01:17:14 nfs-server.domain.de kernel:  ? sysvec_hyperv_stimer0+0x7b/0x90
Sep 30 01:17:14 nfs-server.domain.de kernel:  </IRQ>
Sep 30 01:17:14 nfs-server.domain.de kernel:  <TASK>
Sep 30 01:17:14 nfs-server.domain.de kernel:  ? 
asm_sysvec_hyperv_stimer0+0x1b/0x20
Sep 30 01:17:14 nfs-server.domain.de kernel:  ? 
__rpc_sleep_on_priority+0x80/0x80 [sunrpc]
Sep 30 01:17:14 nfs-server.domain.de kernel:  ? _raw_spin_lock+0x10/0x30
Sep 30 01:17:14 nfs-server.domain.de kernel:  ? __rpc_execute+0x8b/0x270 
[sunrpc]
Sep 30 01:17:14 nfs-server.domain.de kernel:  rpc_async_schedule+0x30/0x50 
[sunrpc]
Sep 30 01:17:14 nfs-server.domain.de kernel:  process_one_work+0x228/0x3d0
Sep 30 01:17:14 nfs-server.domain.de kernel:  worker_thread+0x53/0x420
Sep 30 01:17:14 nfs-server.domain.de kernel:  ? process_one_work+0x3d0/0x3d0
Sep 30 01:17:14 nfs-server.domain.de kernel:  kthread+0x127/0x150
Sep 30 01:17:14 nfs-server.domain.de kernel:  ? set_kthread_struct+0x50/0x50
Sep 30 01:17:14 nfs-server.domain.de kernel:  ret_from_fork+0x1f/0x30
Sep 30 01:17:14 nfs-server.domain.de kernel:  </TASK>

There seem to be more problems with the NFS backchannel at the moment: 
https://lore.kernel.org/linux-nfs/?q=backchannel 
https://access.redhat.com/solutions/7000130

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2062568

Title:
  nfsd gets unresponsive after some hours of operation

Status in linux package in Ubuntu:
  In Progress
Status in nfs-utils package in Ubuntu:
  Confirmed

Bug description:
  I installed the 24.04 Beta on two test machines that were running
  22.04 without issues before. One of them exports two volumes that are
  mounted by the other machine, which primarily uses them as a secondary
  storage for ccache.

  After being up for a couple of hours (happened twice since yesterday
  evening) it seems that nfsd on the machine exporting the volumes hangs
  on something.

  From dmesg on the server (repeated a few times):

  [11183.290548] INFO: task nfsd:1419 blocked for more than 1228 seconds.
  [11183.290558]       Not tainted 6.8.0-22-generic #22-Ubuntu
  [11183.290563] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [11183.290582] task:nfsd            state:D stack:0     pid:1419  tgid:1419  
ppid:2      flags:0x00004000
  [11183.290587] Call Trace:
  [11183.290602]  <TASK>
  [11183.290606]  __schedule+0x27c/0x6b0
  [11183.290612]  schedule+0x33/0x110
  [11183.290615]  schedule_timeout+0x157/0x170
  [11183.290619]  wait_for_completion+0x88/0x150
  [11183.290623]  __flush_workqueue+0x140/0x3e0
  [11183.290629]  nfsd4_probe_callback_sync+0x1a/0x30 [nfsd]
  [11183.290689]  nfsd4_destroy_session+0x186/0x260 [nfsd]
  [11183.290744]  nfsd4_proc_compound+0x3af/0x770 [nfsd]
  [11183.290798]  nfsd_dispatch+0xd4/0x220 [nfsd]
  [11183.290851]  svc_process_common+0x44d/0x710 [sunrpc]
  [11183.290924]  ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
  [11183.290976]  svc_process+0x132/0x1b0 [sunrpc]
  [11183.291041]  svc_handle_xprt+0x4d3/0x5d0 [sunrpc]
  [11183.291105]  svc_recv+0x18b/0x2e0 [sunrpc]
  [11183.291168]  ? __pfx_nfsd+0x10/0x10 [nfsd]
  [11183.291220]  nfsd+0x8b/0xe0 [nfsd]
  [11183.291270]  kthread+0xef/0x120
  [11183.291274]  ? __pfx_kthread+0x10/0x10
  [11183.291276]  ret_from_fork+0x44/0x70
  [11183.291279]  ? __pfx_kthread+0x10/0x10
  [11183.291281]  ret_from_fork_asm+0x1b/0x30
  [11183.291286]  </TASK>

  From dmesg on the client (repeated a number of times):
  [ 6596.911785] RPC: Could not send backchannel reply error: -110
  [ 6596.972490] RPC: Could not send backchannel reply error: -110
  [ 6837.281307] RPC: Could not send backchannel reply error: -110

  ProblemType: Bug
  DistroRelease: Ubuntu 24.04
  Package: nfs-kernel-server 1:2.6.4-3ubuntu5
  ProcVersionSignature: Ubuntu 6.8.0-22.22-generic 6.8.1
  Uname: Linux 6.8.0-22-generic x86_64
  .etc.request-key.d.id_resolver.conf: create   id_resolver     *       *       
/usr/sbin/nfsidmap -t 600 %k %d
  ApportVersion: 2.28.1-0ubuntu1
  Architecture: amd64
  CasperMD5CheckResult: pass
  Date: Fri Apr 19 14:10:25 2024
  InstallationDate: Installed on 2024-04-16 (3 days ago)
  InstallationMedia: Ubuntu-Server 24.04 LTS "Noble Numbat" - Beta amd64 
(20240410.1)
  NFSMounts:

  NFSv4Mounts:

  ProcEnviron:
   LANG=en_US.UTF-8
   PATH=(custom, no user)
   SHELL=/bin/bash
   TERM=xterm-256color
   XDG_RUNTIME_DIR=<set>
  SourcePackage: nfs-utils
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2062568/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to