We installed the unofficial kernel 6.8.0-46-generic-nfs on several NFS client servers on Saturday and have been testing it with high IO loads since then. Unfortunately the server crashed again after about 40 hours with "rcu: INFO: rcu_sched self-detected stall on CPU". The kernel 6.8.0-46-generic-nfs prevents the error message "RPC: Could not send backchannel reply error: -110", but not the crashs which we have been struggling with since August 19th switching the kernel from 6.5.0-44-generic to 6.8.0-40-generic.
Our experiences with NFS server crashes are: - We were able to reproduce the crashes in our production and test environments. Rarely after minutes, sometimes after hours or days, but sometimes not at all, as we often stopped the experiments after 12 to 24 hours. - We have not yet been able to reproduce a crash between a bare metal NFS server and a bare metal NFS client, but between a bare metal NFS server and a virtualized client. - we could not reproduce a crash with NFS vers=4.0 - the crashs happens with or without GSSPROXY Our setup: - virtualized NFS 4.2 server with Ubuntu 22.04.5 LTS and kernel 5.15.0-122-generic - virtualized NFS client with Ubuntu 22.04.5 LTS and kernel 6.8.0-40-generic or kernel 6.8.0-45-generic - /etc/exports : /mnt/home nfsclient(sec=krb5,rw,root_squash,sync,no_subtree_check) - /etc/fstab : nfsserver:/mnt/home /home nfs vers=4.2,rw,soft,sec=krb5,proto=tcp 0 0 - apt info nfs-common : Version: 1:2.6.1-1ubuntu1.2 syslog of NFS server after crash: Sep 30 01:15:51 nfs-server.domain.de kernel: rcu: INFO: rcu_sched self-detected stall on CPU Sep 30 01:15:51 nfs-server.domain.de kernel: rcu: 54-....: (14998 ticks this GP) idle=2db/1/0x4000000000000000 softirq=32173387/32173387 fqs=7449 Sep 30 01:15:51 nfs-server.domain.de kernel: (t=15000 jiffies g=144775177 q=49782) Sep 30 01:15:51 nfs-server.domain.de kernel: NMI backtrace for cpu 54 Sep 30 01:15:51 nfs-server.domain.de kernel: CPU: 54 PID: 153154 Comm: kworker/u480:36 Not tainted 5.15.0-122-generic #132-Ubuntu Sep 30 01:15:51 nfs-server.domain.de kernel: Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019 Sep 30 01:15:51 nfs-server.domain.de kernel: Workqueue: rpciod rpc_async_schedule [sunrpc] Sep 30 01:15:51 nfs-server.domain.de kernel: Call Trace: Sep 30 01:15:51 nfs-server.domain.de kernel: <IRQ> Sep 30 01:15:51 nfs-server.domain.de kernel: show_stack+0x52/0x5c Sep 30 01:15:51 nfs-server.domain.de kernel: dump_stack_lvl+0x4a/0x63 Sep 30 01:15:51 nfs-server.domain.de kernel: dump_stack+0x10/0x16 Sep 30 01:15:51 nfs-server.domain.de kernel: nmi_cpu_backtrace.cold+0x4d/0x93 Sep 30 01:15:51 nfs-server.domain.de kernel: ? lapic_can_unplug_cpu+0x90/0x90 Sep 30 01:15:51 nfs-server.domain.de kernel: nmi_trigger_cpumask_backtrace+0xec/0x100 Sep 30 01:15:51 nfs-server.domain.de kernel: arch_trigger_cpumask_backtrace+0x19/0x20 Sep 30 01:15:51 nfs-server.domain.de kernel: trigger_single_cpu_backtrace+0x44/0x4f Sep 30 01:15:51 nfs-server.domain.de kernel: rcu_dump_cpu_stacks+0x102/0x149 Sep 30 01:15:51 nfs-server.domain.de kernel: print_cpu_stall.cold+0x2f/0xe2 Sep 30 01:15:51 nfs-server.domain.de kernel: check_cpu_stall+0x1d8/0x270 Sep 30 01:15:51 nfs-server.domain.de kernel: rcu_sched_clock_irq+0x9a/0x250 Sep 30 01:15:51 nfs-server.domain.de kernel: update_process_times+0x94/0xd0 Sep 30 01:15:51 nfs-server.domain.de kernel: tick_sched_handle+0x29/0x70 Sep 30 01:15:51 nfs-server.domain.de kernel: tick_sched_timer+0x6f/0x90 Sep 30 01:15:51 nfs-server.domain.de kernel: ? tick_sched_do_timer+0xa0/0xa0 Sep 30 01:15:51 nfs-server.domain.de kernel: __hrtimer_run_queues+0x104/0x230 Sep 30 01:15:51 nfs-server.domain.de kernel: ? read_hv_clock_tsc_cs+0x9/0x30 Sep 30 01:15:51 nfs-server.domain.de kernel: hrtimer_interrupt+0x101/0x220 Sep 30 01:15:51 nfs-server.domain.de kernel: hv_stimer0_isr+0x1d/0x30 Sep 30 01:15:51 nfs-server.domain.de kernel: __sysvec_hyperv_stimer0+0x2f/0x70 Sep 30 01:15:51 nfs-server.domain.de kernel: sysvec_hyperv_stimer0+0x7b/0x90 Sep 30 01:15:51 nfs-server.domain.de kernel: </IRQ> Sep 30 01:15:51 nfs-server.domain.de kernel: <TASK> Sep 30 01:15:51 nfs-server.domain.de kernel: asm_sysvec_hyperv_stimer0+0x1b/0x20 Sep 30 01:15:51 nfs-server.domain.de kernel: RIP: 0010:read_hv_clock_tsc+0x1b/0x60 Sep 30 01:15:51 nfs-server.domain.de kernel: Code: eb bc 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 35 2a 89 97 02 85 f6 74 38 4c 8b 05 27 89 97 02 48 8b 3d 28 89 97 02 0f 01 f9 <66> 90 8b 0d 0d 89 97 02 39 ce 75 d9 48 c1 e2 20 48 09 d0 49 f7 e0 Sep 30 01:15:51 nfs-server.domain.de kernel: RSP: 0018:ffffada44ab33dc8 EFLAGS: 00000202 Sep 30 01:15:51 nfs-server.domain.de kernel: RAX: 000000005d52dc50 RBX: 0002197146e8f7ec RCX: 0000000000000036 Sep 30 01:15:51 nfs-server.domain.de kernel: RDX: 00000000000571f0 RSI: 0000000000000002 RDI: 000000000000000a Sep 30 01:15:51 nfs-server.domain.de kernel: RBP: ffffada44ab33dd0 R08: 00fca74eaf6bde68 R09: ffffffffc06265c8 Sep 30 01:15:51 nfs-server.domain.de kernel: R10: 0000000000000003 R11: ffff97e3daffe358 R12: 0000000000000000 Sep 30 01:15:51 nfs-server.domain.de kernel: R13: 000000000f685174 R14: ffff97e544039d30 R15: 0000000000000001 Sep 30 01:15:51 nfs-server.domain.de kernel: ? read_hv_clock_tsc_cs+0x9/0x30 Sep 30 01:15:51 nfs-server.domain.de kernel: ktime_get+0x43/0xc0 Sep 30 01:15:51 nfs-server.domain.de kernel: rpc_exit_task+0x95/0x110 [sunrpc] Sep 30 01:15:51 nfs-server.domain.de kernel: ? __rpc_sleep_on_priority+0x80/0x80 [sunrpc] Sep 30 01:15:51 nfs-server.domain.de kernel: __rpc_execute+0x65/0x270 [sunrpc] Sep 30 01:15:51 nfs-server.domain.de kernel: rpc_async_schedule+0x30/0x50 [sunrpc] Sep 30 01:15:51 nfs-server.domain.de kernel: process_one_work+0x228/0x3d0 Sep 30 01:15:51 nfs-server.domain.de kernel: worker_thread+0x53/0x420 Sep 30 01:15:51 nfs-server.domain.de kernel: ? process_one_work+0x3d0/0x3d0 Sep 30 01:15:51 nfs-server.domain.de kernel: kthread+0x127/0x150 Sep 30 01:15:51 nfs-server.domain.de kernel: ? set_kthread_struct+0x50/0x50 Sep 30 01:15:51 nfs-server.domain.de kernel: ret_from_fork+0x1f/0x30 Sep 30 01:15:51 nfs-server.domain.de kernel: </TASK> Sep 30 01:17:14 nfs-server.domain.de kernel: watchdog: BUG: soft lockup - CPU#54 stuck for 134s! [kworker/u480:36:153154] Sep 30 01:17:14 nfs-server.domain.de kernel: Modules linked in: tls rpcsec_gss_krb5 nfsv4 nfs fscache netfs binfmt_misc xfs nls_iso8859_1 intel_rapl_msr intel_rapl_common nfit serio_raw hyperv_fb rapl hv_balloon joydev mac_hid sch_fq_codel nfsd dm_multipath scsi_dh_rdac scsi_dh_emc auth_rpcgss scsi_dh_alua nfs_acl lockd grace msr efi_pstore sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hyperv_drm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec hid_generic rc_core hid_hyperv hv_storvsc drm scsi_transport_fc hv_netvsc hid hyperv_keyboard hv_utils crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd hv_vmbus Sep 30 01:17:14 nfs-server.domain.de kernel: CPU: 54 PID: 153154 Comm: kworker/u480:36 Not tainted 5.15.0-122-generic #132-Ubuntu Sep 30 01:17:14 nfs-server.domain.de kernel: Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019 Sep 30 01:17:14 nfs-server.domain.de kernel: Workqueue: rpciod rpc_async_schedule [sunrpc] Sep 30 01:17:14 nfs-server.domain.de kernel: RIP: 0010:_raw_spin_lock+0x10/0x30 Sep 30 01:17:14 nfs-server.domain.de kernel: Code: 89 e5 e8 13 63 36 ff 66 90 5d c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 05 c3 cc cc cc cc 55 89 c6 48 89 e5 e8 de 62 36 ff 66 90 5d c3 Sep 30 01:17:14 nfs-server.domain.de kernel: RSP: 0018:ffffada44ab33e20 EFLAGS: 00000246 Sep 30 01:17:14 nfs-server.domain.de kernel: RAX: 0000000000000000 RBX: ffffffffc05da910 RCX: 0000000000000001 Sep 30 01:17:14 nfs-server.domain.de kernel: RDX: 0000000000000001 RSI: ffff97e544039d00 RDI: ffffffffc0626540 Sep 30 01:17:14 nfs-server.domain.de kernel: RBP: ffffada44ab33e50 R08: 0000000000000001 R09: ffffffffc06265c8 Sep 30 01:17:14 nfs-server.domain.de kernel: R10: 0000000000000003 R11: ffff97e3daffe358 R12: ffff97e544039d00 Sep 30 01:17:14 nfs-server.domain.de kernel: R13: ffffffffc0626540 R14: ffff97e544039d30 R15: 0000000000000001 Sep 30 01:17:14 nfs-server.domain.de kernel: FS: 0000000000000000(0000) GS:ffff9818ba380000(0000) knlGS:0000000000000000 Sep 30 01:17:14 nfs-server.domain.de kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 30 01:17:14 nfs-server.domain.de kernel: CR2: 00007fd9ae6b0240 CR3: 00000001086fc003 CR4: 00000000003706e0 Sep 30 01:17:14 nfs-server.domain.de kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Sep 30 01:17:14 nfs-server.domain.de kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Sep 30 01:17:14 nfs-server.domain.de kernel: Call Trace: Sep 30 01:17:14 nfs-server.domain.de kernel: <IRQ> Sep 30 01:17:14 nfs-server.domain.de kernel: ? show_trace_log_lvl+0x1d6/0x2ea Sep 30 01:17:14 nfs-server.domain.de kernel: ? show_trace_log_lvl+0x1d6/0x2ea Sep 30 01:17:14 nfs-server.domain.de kernel: ? rpc_async_schedule+0x30/0x50 [sunrpc] Sep 30 01:17:14 nfs-server.domain.de kernel: ? show_regs.part.0+0x23/0x29 Sep 30 01:17:14 nfs-server.domain.de kernel: ? show_regs.cold+0x8/0xd Sep 30 01:17:14 nfs-server.domain.de kernel: ? watchdog_timer_fn+0x1be/0x220 Sep 30 01:17:14 nfs-server.domain.de kernel: ? lockup_detector_update_enable+0x60/0x60 Sep 30 01:17:14 nfs-server.domain.de kernel: ? __hrtimer_run_queues+0x104/0x230 Sep 30 01:17:14 nfs-server.domain.de kernel: ? read_hv_clock_tsc_cs+0x9/0x30 Sep 30 01:17:14 nfs-server.domain.de kernel: ? hrtimer_interrupt+0x101/0x220 Sep 30 01:17:14 nfs-server.domain.de kernel: ? hv_stimer0_isr+0x1d/0x30 Sep 30 01:17:14 nfs-server.domain.de kernel: ? __sysvec_hyperv_stimer0+0x2f/0x70 Sep 30 01:17:14 nfs-server.domain.de kernel: ? sysvec_hyperv_stimer0+0x7b/0x90 Sep 30 01:17:14 nfs-server.domain.de kernel: </IRQ> Sep 30 01:17:14 nfs-server.domain.de kernel: <TASK> Sep 30 01:17:14 nfs-server.domain.de kernel: ? asm_sysvec_hyperv_stimer0+0x1b/0x20 Sep 30 01:17:14 nfs-server.domain.de kernel: ? __rpc_sleep_on_priority+0x80/0x80 [sunrpc] Sep 30 01:17:14 nfs-server.domain.de kernel: ? _raw_spin_lock+0x10/0x30 Sep 30 01:17:14 nfs-server.domain.de kernel: ? __rpc_execute+0x8b/0x270 [sunrpc] Sep 30 01:17:14 nfs-server.domain.de kernel: rpc_async_schedule+0x30/0x50 [sunrpc] Sep 30 01:17:14 nfs-server.domain.de kernel: process_one_work+0x228/0x3d0 Sep 30 01:17:14 nfs-server.domain.de kernel: worker_thread+0x53/0x420 Sep 30 01:17:14 nfs-server.domain.de kernel: ? process_one_work+0x3d0/0x3d0 Sep 30 01:17:14 nfs-server.domain.de kernel: kthread+0x127/0x150 Sep 30 01:17:14 nfs-server.domain.de kernel: ? set_kthread_struct+0x50/0x50 Sep 30 01:17:14 nfs-server.domain.de kernel: ret_from_fork+0x1f/0x30 Sep 30 01:17:14 nfs-server.domain.de kernel: </TASK> There seem to be more problems with the NFS backchannel at the moment: https://lore.kernel.org/linux-nfs/?q=backchannel https://access.redhat.com/solutions/7000130 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2062568 Title: nfsd gets unresponsive after some hours of operation Status in linux package in Ubuntu: In Progress Status in nfs-utils package in Ubuntu: Confirmed Bug description: I installed the 24.04 Beta on two test machines that were running 22.04 without issues before. One of them exports two volumes that are mounted by the other machine, which primarily uses them as a secondary storage for ccache. After being up for a couple of hours (happened twice since yesterday evening) it seems that nfsd on the machine exporting the volumes hangs on something. From dmesg on the server (repeated a few times): [11183.290548] INFO: task nfsd:1419 blocked for more than 1228 seconds. [11183.290558] Not tainted 6.8.0-22-generic #22-Ubuntu [11183.290563] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11183.290582] task:nfsd state:D stack:0 pid:1419 tgid:1419 ppid:2 flags:0x00004000 [11183.290587] Call Trace: [11183.290602] <TASK> [11183.290606] __schedule+0x27c/0x6b0 [11183.290612] schedule+0x33/0x110 [11183.290615] schedule_timeout+0x157/0x170 [11183.290619] wait_for_completion+0x88/0x150 [11183.290623] __flush_workqueue+0x140/0x3e0 [11183.290629] nfsd4_probe_callback_sync+0x1a/0x30 [nfsd] [11183.290689] nfsd4_destroy_session+0x186/0x260 [nfsd] [11183.290744] nfsd4_proc_compound+0x3af/0x770 [nfsd] [11183.290798] nfsd_dispatch+0xd4/0x220 [nfsd] [11183.290851] svc_process_common+0x44d/0x710 [sunrpc] [11183.290924] ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd] [11183.290976] svc_process+0x132/0x1b0 [sunrpc] [11183.291041] svc_handle_xprt+0x4d3/0x5d0 [sunrpc] [11183.291105] svc_recv+0x18b/0x2e0 [sunrpc] [11183.291168] ? __pfx_nfsd+0x10/0x10 [nfsd] [11183.291220] nfsd+0x8b/0xe0 [nfsd] [11183.291270] kthread+0xef/0x120 [11183.291274] ? __pfx_kthread+0x10/0x10 [11183.291276] ret_from_fork+0x44/0x70 [11183.291279] ? __pfx_kthread+0x10/0x10 [11183.291281] ret_from_fork_asm+0x1b/0x30 [11183.291286] </TASK> From dmesg on the client (repeated a number of times): [ 6596.911785] RPC: Could not send backchannel reply error: -110 [ 6596.972490] RPC: Could not send backchannel reply error: -110 [ 6837.281307] RPC: Could not send backchannel reply error: -110 ProblemType: Bug DistroRelease: Ubuntu 24.04 Package: nfs-kernel-server 1:2.6.4-3ubuntu5 ProcVersionSignature: Ubuntu 6.8.0-22.22-generic 6.8.1 Uname: Linux 6.8.0-22-generic x86_64 .etc.request-key.d.id_resolver.conf: create id_resolver * * /usr/sbin/nfsidmap -t 600 %k %d ApportVersion: 2.28.1-0ubuntu1 Architecture: amd64 CasperMD5CheckResult: pass Date: Fri Apr 19 14:10:25 2024 InstallationDate: Installed on 2024-04-16 (3 days ago) InstallationMedia: Ubuntu-Server 24.04 LTS "Noble Numbat" - Beta amd64 (20240410.1) NFSMounts: NFSv4Mounts: ProcEnviron: LANG=en_US.UTF-8 PATH=(custom, no user) SHELL=/bin/bash TERM=xterm-256color XDG_RUNTIME_DIR=<set> SourcePackage: nfs-utils UpgradeStatus: No upgrade log present (probably fresh install) To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2062568/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp