Switched to HWE kernel on jammy (6.2.0-32-generic #32~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 18 10:40:13 UTC 2 x86_64 x86_64 x86_64 GNU/Linux) and still basically the same issue:
[33219.508873] ------------[ cut here ]------------ [33219.508877] NETDEV WATCHDOG: enp161s0f1 (ice): transmit queue 35 timed out [33219.508932] WARNING: CPU: 48 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x21f/0x230 [33219.508940] Modules linked in: sch_ingress nf_conntrack_netlink geneve ip6_udp_tunnel udp_tunnel xt_CT dm_crypt scsi_transport_iscsi veth nfnetlink_cttimeout openvswitch nsh nf_conncount unix_diag nft_masq zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_vsock vmw_vsock_virtio_transport_common vhost vhost_iotlb vsock xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink bridge sunrpc nvme_fabrics 8021q garp mrp stp llc bonding tls binfmt_misc ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd dell_wmi kvm_amd video ledtrig_audio nls_iso8859_1 irdma sparse_keymap kvm i40e irqbypass dell_smbios dcdbas ib_uverbs rapl dell_wmi_descriptor wmi_bmof ib_core ccp ptdma k10temp acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ramoops [33219.509051] reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear cdc_ether usbnet mii mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea crct10dif_pclmul sysfillrect sysimgblt crc32_pclmul bcache polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 nvme aesni_intel crypto_simd nvme_core ahci xhci_pci cryptd ice tg3 libahci drm megaraid_sas i2c_piix4 xhci_pci_renesas nvme_common wmi [33219.509114] CPU: 48 PID: 0 Comm: swapper/48 Tainted: P O 6.2.0-32-generic #32~22.04.1-Ubuntu [33219.509116] Hardware name: Dell Inc. PowerEdge R7525/03WYW4, BIOS 2.12.4 07/26/2023 [33219.509118] RIP: 0010:dev_watchdog+0x21f/0x230 [33219.509122] Code: 00 e9 31 ff ff ff 4c 89 e7 c6 05 66 83 78 01 01 e8 56 00 f8 ff 44 89 f1 4c 89 e6 48 c7 c7 08 4f e4 b7 48 89 c2 e8 61 df 2b ff <0f> 0b e9 22 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 [33219.509123] RSP: 0018:ffffb42719fd0e70 EFLAGS: 00010246 [33219.509125] RAX: 0000000000000000 RBX: ffff9bd91b3e74c8 RCX: 0000000000000000 [33219.509126] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [33219.509127] RBP: ffffb42719fd0e98 R08: 0000000000000000 R09: 0000000000000000 [33219.509128] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9bd91b3e7000 [33219.509129] R13: ffff9bd91b3e741c R14: 0000000000000023 R15: 0000000000000000 [33219.509130] FS: 0000000000000000(0000) GS:ffff9b573de00000(0000) knlGS:0000000000000000 [33219.509132] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [33219.509133] CR2: 000055fd64034000 CR3: 0000010273ae2004 CR4: 0000000000770ee0 [33219.509135] PKRU: 55555554 [33219.509135] Call Trace: [33219.509137] <IRQ> [33219.509140] ? show_regs+0x72/0x90 [33219.509145] ? dev_watchdog+0x21f/0x230 [33219.509147] ? __warn+0x8d/0x160 [33219.509151] ? dev_watchdog+0x21f/0x230 [33219.509154] ? report_bug+0x1bb/0x1d0 [33219.509158] ? handle_bug+0x46/0x90 [33219.509162] ? exc_invalid_op+0x19/0x80 [33219.509165] ? asm_exc_invalid_op+0x1b/0x20 [33219.509171] ? dev_watchdog+0x21f/0x230 [33219.509174] ? __pfx_dev_watchdog+0x10/0x10 [33219.509176] call_timer_fn+0x2c/0x160 [33219.509180] ? __pfx_dev_watchdog+0x10/0x10 [33219.509182] __run_timers.part.0+0x1fb/0x2b0 [33219.509185] ? ktime_get+0x46/0xc0 [33219.509187] ? __pfx_tick_sched_timer+0x10/0x10 [33219.509191] ? native_apic_msr_write+0x46/0x70 [33219.509194] ? lapic_next_event+0x20/0x30 [33219.509197] ? clockevents_program_event+0xb5/0x140 [33219.509200] run_timer_softirq+0x2a/0x60 [33219.509202] __do_softirq+0xdd/0x330 [33219.509205] ? hrtimer_interrupt+0x12b/0x250 [33219.509208] __irq_exit_rcu+0xa2/0xd0 [33219.509210] irq_exit_rcu+0xe/0x20 [33219.509212] sysvec_apic_timer_interrupt+0x96/0xb0 [33219.509215] </IRQ> [33219.509216] <TASK> [33219.509216] asm_sysvec_apic_timer_interrupt+0x1b/0x20 [33219.509219] RIP: 0010:mwait_idle+0x55/0x90 [33219.509222] Code: 31 d2 48 89 d1 65 48 8b 04 25 40 18 03 00 0f 01 c8 48 8b 00 a8 08 75 14 eb 07 0f 00 2d 24 d2 35 00 31 c0 48 89 c1 fb 0f 01 c9 <eb> 06 fb 0f 1f 44 00 00 65 48 8b 04 25 40 18 03 00 f0 80 60 02 df [33219.509224] RSP: 0018:ffffb42700587e80 EFLAGS: 00000246 [33219.509225] RAX: 0000000000000000 RBX: ffff9ad9ccd999c0 RCX: 0000000000000000 [33219.509226] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [33219.509227] RBP: ffffb42700587e80 R08: 0000000000000000 R09: 0000000000000000 [33219.509229] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [33219.509230] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [33219.509232] arch_cpu_idle+0x15/0x20 [33219.509235] default_idle_call+0x4a/0x120 [33219.509237] cpuidle_idle_call+0x185/0x1e0 [33219.509241] do_idle+0x82/0x110 [33219.509243] cpu_startup_entry+0x20/0x30 [33219.509245] start_secondary+0x122/0x160 [33219.509248] secondary_startup_64_no_verify+0xe5/0xeb [33219.509253] </TASK> [33219.509254] ---[ end trace 0000000000000000 ]--- [33220.417178] ice 0000:a1:00.1 enp161s0f1: tx_timeout: VSI_num: 8, Q 35, NTC: 0x42, HW_HEAD: 0x41, NTU: 0x42, INT: 0x0 [33220.417186] ice 0000:a1:00.1 enp161s0f1: tx_timeout recovery level 1, txqueue 35 [33223.905010] bond0: (slave enp161s0f1): link status definitely down, disabling slave [33223.905018] bond0: active interface up! [33224.344729] ice 0000:a1:00.1: PTP reset successful [33655.093659] ice 0000:a1:00.1: VSI rebuilt. VSI index 0, type ICE_VSI_PF [33655.104975] ice 0000:a1:00.1: VSI rebuilt. VSI index 383, type ICE_VSI_CTRL [33655.217315] bond0: (slave enp161s0f1): link status definitely up, 25000 Mbps full duplex [33666.895550] ice 0000:a1:00.1 enp161s0f1: tx_timeout: VSI_num: 8, Q 92, NTC: 0x17, HW_HEAD: 0x25, NTU: 0x26, INT: 0x0 [33666.895557] ice 0000:a1:00.1 enp161s0f1: tx_timeout recovery level 1, txqueue 92 [33670.816422] bond0: (slave enp161s0f1): link status definitely down, disabling slave [33671.261841] ice 0000:a1:00.1: PTP reset successful [33961.392293] ice 0000:a1:00.1: VSI rebuilt. VSI index 0, type ICE_VSI_PF [33961.410920] ice 0000:a1:00.1: VSI rebuilt. VSI index 383, type ICE_VSI_CTRL [33961.476136] bond0: (slave enp161s0f1): link status definitely up, 25000 Mbps full duplex -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2004262 Title: Intel E810 NICs driver in causing hangs when booting and bonds configured Status in linux package in Ubuntu: Confirmed Status in linux source package in Jammy: Fix Released Status in linux source package in Kinetic: Fix Released Status in linux source package in Lunar: Confirmed Bug description: [Impact] * Intel E810-family NICs cause system hangs when booting with bonding enabled * This happens due to the driver unplugging auxiliary devices * The unplug event happens under RTNL lock context, which causes a deadlock where the RDMA driver waits for the RNL lock to complete removal [Test Plan] * Users have reported that after setting up bonding on switch and server side, the system will hang when starting network services [Fix] * The upstream patch defers unplugging/re-plugging of the auxiliary device, so that it's not performed under the RTNL lock context. * Fix was introduced by commit: 248401cb2c46 ice: avoid bonding causing auxiliary plug/unplug under RTNL lock [Regression Potential] * Regressions would manifest in devices that support RDMA functionality and have been added to a bond * We should look out for auxiliary devices that haven't been properly unplugged, or that cause further issues with ice_plug_aux_dev()/ice_unplug_aux_dev() [Original Description] jammy 22.04.1 linux-image-generic 5.15.0-58-generic Intel E810-XXV Dual Port NICs in Dell PowerEdge 650 - 5.15 in jammy -> reproducible - 5.19 in hwe-edge -> reproducible - 6.2.rc6 in the mainline build -> works - Intel's ice driver 1.10.1.2.2 -> works After beonding is enabled on switch and server side, the system will hang at initialing ubuntu. The kernel loads but around starting the Network Services the system can hang for sometimes 5 minutes, and in other cases, indefinitely. The message of: echo 0 > /proc/sys/kernel/hung_task_timeout_sec” systemd-resolve blocked for more than 120 seconds appears, and eventually the Network services just attempts to start and never does. This is with or without DHCP enabled. Tried this same setup with the hwe-22.04, hwe-20.04, hwe-22.04-ege and linux-oem kernels and all exhibit the same failure. To work around this. installing the Intel 'ice' driver of version 1.10.1.2.2 works. The system doesn't even remotely hang at startup and all networking functions remain working (ping, DNS, general accessibility). The driver can be found at https://downloadmirror.intel.com/763930/ice-1.10.1.2.2.tar.gz --- ProblemType: Bug AlsaDevices: total 0 crw-rw---- 1 root audio 116, 1 Jan 31 13:08 seq crw-rw---- 1 root audio 116, 33 Jan 31 13:08 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.3 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5json: { "result": "skip" }DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-01-27 (3 days ago)InstallationMedia: Ubuntu-Server 22.04.1 LTS "Jammy Jellyfish" - Release amd64 (20220809) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R650 Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-58-generic root=UUID=668aab7c-abe9-434b-a810-acc6eab76cbc ro fsck.mode=skip ProcVersionSignature: Ubuntu 5.15.0-58.64-generic 5.15.74 RelatedPackageVersions: linux-restricted-modules-5.15.0-58-generic N/A linux-backports-modules-5.15.0-58-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.9 RfKill: Error: [Errno 2] No such file or directory: 'rfkill'Tags: jammy uec-images Uname: Linux 5.15.0-58-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 09/14/2022 dmi.bios.release: 1.8 dmi.bios.vendor: Dell Inc. dmi.bios.version: 1.8.2 dmi.board.name: 0PJ7YJ dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr1.8.2:bd09/14/2022:br1.8:svnDellInc.:pnPowerEdgeR650:pvr:rvnDellInc.:rn0PJ7YJ:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=0912;ModelName=PowerEdgeR650: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R650 dmi.product.sku: SKU=0912;ModelName=PowerEdge R650 dmi.sys.vendor: Dell Inc. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2004262/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp