Hi, Thanks for your efforts with this issue, however we're still experiencing problems with the newest kernel. Sorry about missing the patch-testing-window, we should have been there for you :)
After only 20 minutes of runtime with the new kernel, we saw the following, and networking is basically useless: [ 2.410644] i40e: Intel(R) Ethernet Connection XL710 Network Driver - version 1.4.25-k [ 2.419791] i40e: Copyright (c) 2013 - 2014 Intel Corporation. [ 2.483362] i40e 0000:02:00.0: fw 5.40.47690 api 1.5 nvm 5.40 0x80002d35 18.0.16 [ 2.896678] i40e 0000:02:00.0: MAC address: 3c:fd:fe:1a:b5:e0 [ 2.903768] i40e 0000:02:00.0: SAN MAC: 3c:fd:fe:1a:b5:e1 [ 3.189818] i40e 0000:02:00.0: PCI-Express: Speed 8.0GT/s Width x4 [ 3.193934] i40e 0000:02:00.0: PCI-Express bandwidth available for this device may be insufficient for optimal performance. [ 3.202198] i40e 0000:02:00.0: Please move the device to a different PCI-e link with more lanes and/or higher transfer rate. [ 3.241095] i40e 0000:02:00.0: Features: PF-id[0] VFs: 64 VSIs: 2 QP: 4 RX: 1BUF RSS FD_ATR FD_SB NTUPLE DCB VxLAN Geneve PTP VEPA [ 3.279202] i40e 0000:02:00.1: fw 5.40.47690 api 1.5 nvm 5.40 0x80002d35 18.0.16 [ 3.531346] i40e 0000:02:00.1: MAC address: 3c:fd:fe:1a:b5:e2 [ 3.539557] i40e 0000:02:00.1: SAN MAC: 3c:fd:fe:1a:b5:e3 [ 3.761719] i40e 0000:02:00.1: PCI-Express: Speed 8.0GT/s Width x4 [ 3.765721] i40e 0000:02:00.1: PCI-Express bandwidth available for this device may be insufficient for optimal performance. [ 3.773539] i40e 0000:02:00.1: Please move the device to a different PCI-e link with more lanes and/or higher transfer rate. [ 3.812022] i40e 0000:02:00.1: Features: PF-id[1] VFs: 64 VSIs: 2 QP: 4 RX: 1BUF RSS FD_ATR FD_SB NTUPLE DCB VxLAN Geneve PTP VEPA [ 3.855168] i40e 0000:02:00.0 p1p1: renamed from eth2 [ 3.895278] i40e 0000:02:00.1 p1p2: renamed from eth0 [ 7.205832] i40e 0000:02:00.1 p1p2: already using mac address 3c:fd:fe:1a:b5:e2 [ 7.208378] i40e 0000:02:00.1 p1p2: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None [ 7.208401] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e2 vid=0 [ 7.208453] i40e 0000:02:00.0 p1p1: set new mac address 3c:fd:fe:1a:b5:e2 [ 7.217191] i40e 0000:02:00.0 p1p1: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None [ 7.217215] i40e 0000:02:00.0 p1p1: adding 3c:fd:fe:1a:b5:e2 vid=0 [ 7.240919] i40e 0000:02:00.1 p1p2: set new mac address 3c:fd:fe:1a:b5:e0 [ 7.252720] i40e 0000:02:00.0 p1p1: returning to hw mac address 3c:fd:fe:1a:b5:e0 [ 7.324791] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=5 [ 7.324798] i40e 0000:02:00.0 p1p1: adding 3c:fd:fe:1a:b5:e0 vid=5 [ 1109.574733] i40e 0000:02:00.1: TX driver issue detected, PF reset issued [ 1110.011152] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=0 [ 1110.011155] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=5 [ 1110.013749] i40e 0000:02:00.1: TX driver issue detected, PF reset issued [ 1110.013773] i40e 0000:02:00.1 p1p2: speed changed to 0 for port p1p2 [ 1110.013954] bond0: link status up again after 0 ms for interface p1p2 [ 1110.983823] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=0 [ 1110.983825] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=5 [ 1110.985836] bond0: link status up again after 0 ms for interface p1p2 [ 1111.432231] i40e 0000:02:00.0: TX driver issue detected, PF reset issued [ 1111.981828] i40e 0000:02:00.0 p1p1: adding 3c:fd:fe:1a:b5:e0 vid=0 [ 1111.981835] i40e 0000:02:00.0 p1p1: adding 3c:fd:fe:1a:b5:e0 vid=5 [ 1111.984816] i40e 0000:02:00.0: TX driver issue detected, PF reset issued [ 1111.987007] bond0: link status up again after 0 ms for interface p1p1 [ 1112.981796] i40e 0000:02:00.0 p1p1: adding 3c:fd:fe:1a:b5:e0 vid=0 [ 1112.981803] i40e 0000:02:00.0 p1p1: adding 3c:fd:fe:1a:b5:e0 vid=5 [ 1112.985812] bond0: link status up again after 0 ms for interface p1p1 [ 1114.204548] i40e 0000:02:00.1: TX driver issue detected, PF reset issued [ 1114.983686] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=0 [ 1114.983688] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=5 [ 1114.985692] bond0: link status up again after 0 ms for interface p1p2 [ 1115.752686] i40e 0000:02:00.1: TX driver issue detected, PF reset issued [ 1116.985619] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=0 [ 1116.985624] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=5 [ 1116.988361] i40e 0000:02:00.1 p1p2: speed changed to 0 for port p1p2 [ 1116.989607] bond0: link status up again after 0 ms for interface p1p2 # uname -a Linux lb05 4.4.0-97-generic #120-Ubuntu SMP Tue Sep 19 17:28:18 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux # modinfo i40e filename: /lib/modules/4.4.0-97-generic/kernel/drivers/net/ethernet/intel/i40e/i40e.ko version: 1.4.25-k As a workaround we're using i40e driver v2.0.30 via dkms, which does works fine without any issues so far, but it would be nice to have this problem fixed properly :-) If we're going about this in the wrong way, and our problem is not applicable to this fix, please let us know. We're happy to test new patches if there are any. We're gonna test the HWE 4.10 kernel mentioned and see how that behaves. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1713553 Title: Intel i40e PF reset due to incorrect MDD detection Status in linux package in Ubuntu: In Progress Status in linux source package in Xenial: Fix Released Bug description: [Impact] Using an Intel i40e network device, under heavy traffic load with TSO enabled, the device will spontaneously reset itself and issue errors similar to the following: Jun 14 14:09:51 hostname kernel: [4253913.851053] i40e 0000:05:00.1: TX driver issue detected, PF reset issued Jun 14 14:09:53 hostname kernel: [4253915.476283] i40e 0000:05:00.1: TX driver issue detected, PF reset issued Jun 14 14:09:54 hostname kernel: [4253917.411264] i40e 0000:05:00.1: TX driver issue detected, PF reset issued This causes a full reset of the PF, which causes an interruption in traffic flow. This was partially fixed by Xenial commit 12f8cc59d5886b86372f45290166deca57a60d7a, however there is one additional upstream commit required to fully fix the issue: commit 841493a3f64395b60554afbcaa17f4350f90e764 Author: Alexander Duyck <alexander.h.du...@intel.com> Date: Tue Sep 6 18:05:04 2016 -0700 i40e: Limit TX descriptor count in cases where frag size is greater than 16K This fix was never backported into the Xenial 4.4 kernel series, but is already present in the Xenial HWE (and Zesty) 4.10 kernel. [Testcase] In this case, the issue occurs at a customer site using i40e based Intel network cards with SR-IOV enabled. Under heavy load, the card will reset itself as described. [Regression Potential] As with any change to a network card driver, this may cause regressions with network I/O through i40e card(s). However, this specific change only increases the likelyhood that any specific large TSO tx will need to be linearized, which will avoid the PF reset. Linearizing a TSO tx that did not need to be linearized will not cause any failures, it may only decrease performance slightly. However this patch should only cause linearization when required to avoid the MDD detection and PF reset. [Other Info] The previous bug for this issue is bug 1700834. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1713553/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp