@Robert, first thanks a lot for pursuing this issue! 1) I certainly can provide the debugging info. May I ask if ...
a) the system in question would need to have an active LAG (LACP) for this to be helpful? We did switch to active-backup on all our machines due to this very issue. b) this requires FW version 4.20? All our machines currently run 4.30 already. c) this requires a certain kernel / ice driver version? 2) There now is FW 4.40 out [1]. But there seem to be no fixes related to LAG / LACP, some regarding SRIO-V though. But I guess you are convinced the issue is not within the FW, but rather the ice driver? [1] - https://www.intel.com/content/www/us/en/download/19624/non-volatile-memory-nvm-update-utility-for-intel-ethernet-network-adapter-e810-series.html -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Confirmed Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw---- 1 root audio 116, 1 Sep 12 20:05 seq crw-rw---- 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7515 Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116 RelatedPackageVersions: linux-restricted-modules-5.15.0-83-generic N/A linux-backports-modules-5.15.0-83-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.18 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 5.15.0-83-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 07/27/2023 dmi.bios.release: 2.12 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.12.4 dmi.board.name: 0J91V2 dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R7515 dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515 dmi.sys.vendor: Dell Inc. --- ProblemType: Bug AlsaDevices: total 0 crw-rw---- 1 root audio 116, 1 Sep 15 03:13 seq crw-rw---- 1 root audio 116, 33 Sep 15 03:13 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied Cannot stat file /proc/323635/fd/10: Permission denied CRDA: N/A CasperMD5CheckResult: unknown CloudArchitecture: x86_64 CloudID: maas CloudName: maas CloudPlatform: maas CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/) DistroRelease: Ubuntu 22.04 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7525 NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-6.2.0-32-generic root=UUID=9b437790-e6e2-4a2e-af79-5b13fee932af ro ProcVersionSignature: Ubuntu 6.2.0-32.32~22.04.1-generic 6.2.16 RebootRequiredPkgs: Error: path contained symlinks. RelatedPackageVersions: linux-restricted-modules-6.2.0-32-generic N/A linux-backports-modules-6.2.0-32-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.18 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 6.2.0-32-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 07/26/2023 dmi.bios.release: 2.12 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.12.4 dmi.board.name: 03WYW4 dmi.board.vendor: Dell Inc. dmi.board.version: A02 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.12.4:bd07/26/2023:br2.12:svnDellInc.:pnPowerEdgeR7525:pvr:rvnDellInc.:rn03WYW4:rvrA02:cvnDellInc.:ct23:cvr:skuSKU=08FF;ModelName=PowerEdgeR7525: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R7525 dmi.product.sku: SKU=08FF;ModelName=PowerEdge R7525 dmi.sys.vendor: Dell Inc. mtime.conffile..etc.logrotate.d.apport: 2023-09-15T13:17:01.203771 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp