Hey @Christian, 1a) No need, AQ 0x000A returns NVM capabilities regardless of configuration applied (it's done during driver init) 1b) That's the point, I noticed you upgraded to 4.3 which I currently don't have access to and I wanted to verify capabilities on 4.3. NVM caps should be similar on the same NVM version in single head of family so values I had access to would be the same you had on 4.2 (meaning there is no point of collecting these) 1c) No, any "recent" kernel/driver version will support enabling debug logs by adding dyndbg=+p param to module. We only care for logs which are retrieved from NVM and printed with debug flags.
2) The issue based on recent patches from Intel is caused by performing LAG related operations without proper support from NVM. Release notes does not always tell every feature change so there is a possibility that 4.4 introduced sriov_lag capability but I cannot verify it. Worst case scenario is that NVM 4.4 will introduce sriov_lag capability, meaning patches added recently to upstream kernel will have no effect, and also issue will still reproduce. In this scenario currently there will be no 'workaround' for it. Best case scenario is that NVM 4.4 will introduce sriov_lag capability and issue will no longer reproduce. In this scenario no additional patches to the driver will be required. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Confirmed Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw---- 1 root audio 116, 1 Sep 12 20:05 seq crw-rw---- 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7515 Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116 RelatedPackageVersions: linux-restricted-modules-5.15.0-83-generic N/A linux-backports-modules-5.15.0-83-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.18 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 5.15.0-83-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 07/27/2023 dmi.bios.release: 2.12 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.12.4 dmi.board.name: 0J91V2 dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R7515 dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515 dmi.sys.vendor: Dell Inc. --- ProblemType: Bug AlsaDevices: total 0 crw-rw---- 1 root audio 116, 1 Sep 15 03:13 seq crw-rw---- 1 root audio 116, 33 Sep 15 03:13 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied Cannot stat file /proc/323635/fd/10: Permission denied CRDA: N/A CasperMD5CheckResult: unknown CloudArchitecture: x86_64 CloudID: maas CloudName: maas CloudPlatform: maas CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/) DistroRelease: Ubuntu 22.04 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7525 NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-6.2.0-32-generic root=UUID=9b437790-e6e2-4a2e-af79-5b13fee932af ro ProcVersionSignature: Ubuntu 6.2.0-32.32~22.04.1-generic 6.2.16 RebootRequiredPkgs: Error: path contained symlinks. RelatedPackageVersions: linux-restricted-modules-6.2.0-32-generic N/A linux-backports-modules-6.2.0-32-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.18 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 6.2.0-32-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 07/26/2023 dmi.bios.release: 2.12 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.12.4 dmi.board.name: 03WYW4 dmi.board.vendor: Dell Inc. dmi.board.version: A02 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.12.4:bd07/26/2023:br2.12:svnDellInc.:pnPowerEdgeR7525:pvr:rvnDellInc.:rn03WYW4:rvrA02:cvnDellInc.:ct23:cvr:skuSKU=08FF;ModelName=PowerEdgeR7525: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R7525 dmi.product.sku: SKU=08FF;ModelName=PowerEdge R7525 dmi.sys.vendor: Dell Inc. mtime.conffile..etc.logrotate.d.apport: 2023-09-15T13:17:01.203771 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp