This bug is awaiting verification that the linux-aws/6.5.0-1015.15 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-mantic-linux-aws' to 'verification-done-mantic- linux-aws'. If the problem still exists, change the tag 'verification- needed-mantic-linux-aws' to 'verification-failed-mantic-linux-aws'.
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-mantic-linux-aws-v2 verification-needed-mantic-linux-aws -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: Fix Released Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw---- 1 root audio 116, 1 Sep 12 20:05 seq crw-rw---- 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7515 Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116 RelatedPackageVersions: linux-restricted-modules-5.15.0-83-generic N/A linux-backports-modules-5.15.0-83-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.18 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 5.15.0-83-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 07/27/2023 dmi.bios.release: 2.12 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.12.4 dmi.board.name: 0J91V2 dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R7515 dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515 dmi.sys.vendor: Dell Inc. --- ProblemType: Bug AlsaDevices: total 0 crw-rw---- 1 root audio 116, 1 Sep 15 03:13 seq crw-rw---- 1 root audio 116, 33 Sep 15 03:13 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied Cannot stat file /proc/323635/fd/10: Permission denied CRDA: N/A CasperMD5CheckResult: unknown CloudArchitecture: x86_64 CloudID: maas CloudName: maas CloudPlatform: maas CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/) DistroRelease: Ubuntu 22.04 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7525 NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-6.2.0-32-generic root=UUID=9b437790-e6e2-4a2e-af79-5b13fee932af ro ProcVersionSignature: Ubuntu 6.2.0-32.32~22.04.1-generic 6.2.16 RebootRequiredPkgs: Error: path contained symlinks. RelatedPackageVersions: linux-restricted-modules-6.2.0-32-generic N/A linux-backports-modules-6.2.0-32-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.18 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 6.2.0-32-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 07/26/2023 dmi.bios.release: 2.12 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.12.4 dmi.board.name: 03WYW4 dmi.board.vendor: Dell Inc. dmi.board.version: A02 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.12.4:bd07/26/2023:br2.12:svnDellInc.:pnPowerEdgeR7525:pvr:rvnDellInc.:rn03WYW4:rvrA02:cvnDellInc.:ct23:cvr:skuSKU=08FF;ModelName=PowerEdgeR7525: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R7525 dmi.product.sku: SKU=08FF;ModelName=PowerEdge R7525 dmi.sys.vendor: Dell Inc. mtime.conffile..etc.logrotate.d.apport: 2023-09-15T13:17:01.203771 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp