** Tags added: kernel-daily-bug
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239
Title:
Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
Status in linux package in Ubuntu:
In Progress
Status in linux source package in Jammy:
Fix Released
Status in linux source package in Mantic:
Fix Released
Status in linux source package in Noble:
In Progress
Bug description:
[Impact]
* Issue is causing transmit hang on E810 ports with bonding enabled.
* Based on the provided logs, TX hang can last for even a couple of
minutes, but in most scenarios, the network will be recovered after the ice
driver performs a PF reset (TX hang handler routine).
* Originally, the issue was observed during Tempest tests on a newly
created OpenStack cluster, resulting in a lack of certification.
[Fix]
* Initially, a workaround has been proposed by Intel engineers to disable
LAG initialization [1].
This change has been tested in an environment where reproduction is
easily achieved.
After multiple iterations, no reproduction has been observed.
* Shortly after, Intel proposed a patch [2] to disable LAG initialization
if NVM does not expose proper capabilities.
[Test Plan]
* To reproduce the issue, over a 20-node cluster was used with Ceph-based
storage. The problem could sometimes manifest while deploying a cluster or
after the cluster was already deployed during the Tempest test run.
* The issue could appear on a random node, making reproduction hard to
achieve.
* Multiple stress tests on single host with similar configuration did not
trigger a reproduction.
[Where problems could occur]
* All ice drivers with ice_lag_event_handler registered can expose the
issue. This handler is not implemented in 20.04
* CVL4.2 and older NVM images for E810 does not expose SRIOV LAG
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this
capability will be released.
Although potentialy issue is caused by using features without proper FW
support [2], we want to take a closer look once NVMs with proper support are
introduced.
[1] -
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
[2] -
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
4d50fcdc2476eef94c14c6761073af5667bb43b6
[Other Info]
* Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice
driver backported from mainline kernel from before patch [2] was added.
* Original description of the case below:
I'm having issues with an Intel E810-XXV card on a Dell server under
Ubuntu Jammy.
Details:
- hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
Controller E810-XXV for SFP (rev 02)
- tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
`6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.
- using a bond over the two ports of the same card, at 25Gbps to two
different switches, bond is using LACP with hash layer3+4 and fast
timeout. But I believe the bug is not directly related to bonding as
the problem seems to be in the interface.
- machine installed by maas. No issues during installation, but at
that time bond is not formed yet, later when linux is booted, the bond
is formed and works without issues for a while
- it works for about 2 to 3 hours fine, then the issue starts (may or
may not be related to network load, but it seems that it is triggered
by some tests that I run after openstack finishes installing)
- one of the legs of the bond freezes and everything that would go to
that lag is discarded, in and out, ping to random external hosts start
losing every second packet
- after some time you can see on the kernel log messages about "NETDEV
WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
trace
- the switch does log that the bond is flapping
---
ProblemType: Bug
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 Sep 12 20:05 seq
crw-rw---- 1 root audio 116, 33 Sep 12 20:05 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.11-0ubuntu82.5
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq',
'/dev/snd/timer'] failed with exit code 1:
CRDA: N/A
CasperMD5CheckResult: pass
CloudArchitecture: x86_64
CloudID: none
CloudName: none
CloudPlatform: none
CloudSubPlatform: config
DistroRelease: Ubuntu 22.04
InstallationDate: Installed on 2023-08-22 (24 days ago)
InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release
amd64 (20230810)
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
MachineType: Dell Inc. PowerEdge R7515
Package: linux (not installed)
PciMultimedia:
ProcFB: 0 mgag200drmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic
root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro
ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116
RelatedPackageVersions:
linux-restricted-modules-5.15.0-83-generic N/A
linux-backports-modules-5.15.0-83-generic N/A
linux-firmware 20220329.git681281e4-0ubuntu3.18
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
Tags: jammy uec-images
Uname: Linux 5.15.0-83-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: N/A
_MarkForUpload: True
dmi.bios.date: 07/27/2023
dmi.bios.release: 2.12
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 2.12.4
dmi.board.name: 0J91V2
dmi.board.vendor: Dell Inc.
dmi.board.version: A01
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias:
dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515:
dmi.product.family: PowerEdge
dmi.product.name: PowerEdge R7515
dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515
dmi.sys.vendor: Dell Inc.
---
ProblemType: Bug
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 Sep 15 03:13 seq
crw-rw---- 1 root audio 116, 33 Sep 15 03:13 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.11-0ubuntu82.5
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse:
Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with
exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied
Cannot stat file /proc/323635/fd/10: Permission denied
CRDA: N/A
CasperMD5CheckResult: unknown
CloudArchitecture: x86_64
CloudID: maas
CloudName: maas
CloudPlatform: maas
CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/)
DistroRelease: Ubuntu 22.04
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
MachineType: Dell Inc. PowerEdge R7525
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
Package: linux (not installed)
PciMultimedia:
ProcFB: 0 mgag200drmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-6.2.0-32-generic
root=UUID=9b437790-e6e2-4a2e-af79-5b13fee932af ro
ProcVersionSignature: Ubuntu 6.2.0-32.32~22.04.1-generic 6.2.16
RebootRequiredPkgs: Error: path contained symlinks.
RelatedPackageVersions:
linux-restricted-modules-6.2.0-32-generic N/A
linux-backports-modules-6.2.0-32-generic N/A
linux-firmware 20220329.git681281e4-0ubuntu3.18
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
Tags: jammy uec-images
Uname: Linux 6.2.0-32-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: N/A
_MarkForUpload: True
dmi.bios.date: 07/26/2023
dmi.bios.release: 2.12
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 2.12.4
dmi.board.name: 03WYW4
dmi.board.vendor: Dell Inc.
dmi.board.version: A02
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias:
dmi:bvnDellInc.:bvr2.12.4:bd07/26/2023:br2.12:svnDellInc.:pnPowerEdgeR7525:pvr:rvnDellInc.:rn03WYW4:rvrA02:cvnDellInc.:ct23:cvr:skuSKU=08FF;ModelName=PowerEdgeR7525:
dmi.product.family: PowerEdge
dmi.product.name: PowerEdge R7525
dmi.product.sku: SKU=08FF;ModelName=PowerEdge R7525
dmi.sys.vendor: Dell Inc.
mtime.conffile..etc.logrotate.d.apport: 2023-09-15T13:17:01.203771
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp