Hi,

Thanks for your efforts with this issue, however we're still
experiencing problems with the newest kernel. Sorry about missing the
patch-testing-window, we should have been there for you :)

After only 20 minutes of runtime with the new kernel, we saw the
following, and networking is basically useless:

[    2.410644] i40e: Intel(R) Ethernet Connection XL710 Network Driver - 
version 1.4.25-k
[    2.419791] i40e: Copyright (c) 2013 - 2014 Intel Corporation.
[    2.483362] i40e 0000:02:00.0: fw 5.40.47690 api 1.5 nvm 5.40 0x80002d35 
18.0.16
[    2.896678] i40e 0000:02:00.0: MAC address: 3c:fd:fe:1a:b5:e0
[    2.903768] i40e 0000:02:00.0: SAN MAC: 3c:fd:fe:1a:b5:e1
[    3.189818] i40e 0000:02:00.0: PCI-Express: Speed 8.0GT/s Width x4
[    3.193934] i40e 0000:02:00.0: PCI-Express bandwidth available for this 
device may be insufficient for optimal performance.
[    3.202198] i40e 0000:02:00.0: Please move the device to a different PCI-e 
link with more lanes and/or higher transfer rate.
[    3.241095] i40e 0000:02:00.0: Features: PF-id[0] VFs: 64 VSIs: 2 QP: 4 RX: 
1BUF RSS FD_ATR FD_SB NTUPLE DCB VxLAN Geneve PTP VEPA
[    3.279202] i40e 0000:02:00.1: fw 5.40.47690 api 1.5 nvm 5.40 0x80002d35 
18.0.16
[    3.531346] i40e 0000:02:00.1: MAC address: 3c:fd:fe:1a:b5:e2
[    3.539557] i40e 0000:02:00.1: SAN MAC: 3c:fd:fe:1a:b5:e3
[    3.761719] i40e 0000:02:00.1: PCI-Express: Speed 8.0GT/s Width x4
[    3.765721] i40e 0000:02:00.1: PCI-Express bandwidth available for this 
device may be insufficient for optimal performance.
[    3.773539] i40e 0000:02:00.1: Please move the device to a different PCI-e 
link with more lanes and/or higher transfer rate.
[    3.812022] i40e 0000:02:00.1: Features: PF-id[1] VFs: 64 VSIs: 2 QP: 4 RX: 
1BUF RSS FD_ATR FD_SB NTUPLE DCB VxLAN Geneve PTP VEPA
[    3.855168] i40e 0000:02:00.0 p1p1: renamed from eth2
[    3.895278] i40e 0000:02:00.1 p1p2: renamed from eth0
[    7.205832] i40e 0000:02:00.1 p1p2: already using mac address 
3c:fd:fe:1a:b5:e2
[    7.208378] i40e 0000:02:00.1 p1p2: NIC Link is Up 10 Gbps Full Duplex, Flow 
Control: None
[    7.208401] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e2 vid=0
[    7.208453] i40e 0000:02:00.0 p1p1: set new mac address 3c:fd:fe:1a:b5:e2
[    7.217191] i40e 0000:02:00.0 p1p1: NIC Link is Up 10 Gbps Full Duplex, Flow 
Control: None
[    7.217215] i40e 0000:02:00.0 p1p1: adding 3c:fd:fe:1a:b5:e2 vid=0
[    7.240919] i40e 0000:02:00.1 p1p2: set new mac address 3c:fd:fe:1a:b5:e0
[    7.252720] i40e 0000:02:00.0 p1p1: returning to hw mac address 
3c:fd:fe:1a:b5:e0
[    7.324791] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=5
[    7.324798] i40e 0000:02:00.0 p1p1: adding 3c:fd:fe:1a:b5:e0 vid=5
[ 1109.574733] i40e 0000:02:00.1: TX driver issue detected, PF reset issued
[ 1110.011152] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=0
[ 1110.011155] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=5
[ 1110.013749] i40e 0000:02:00.1: TX driver issue detected, PF reset issued
[ 1110.013773] i40e 0000:02:00.1 p1p2: speed changed to 0 for port p1p2
[ 1110.013954] bond0: link status up again after 0 ms for interface p1p2
[ 1110.983823] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=0
[ 1110.983825] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=5
[ 1110.985836] bond0: link status up again after 0 ms for interface p1p2
[ 1111.432231] i40e 0000:02:00.0: TX driver issue detected, PF reset issued
[ 1111.981828] i40e 0000:02:00.0 p1p1: adding 3c:fd:fe:1a:b5:e0 vid=0
[ 1111.981835] i40e 0000:02:00.0 p1p1: adding 3c:fd:fe:1a:b5:e0 vid=5
[ 1111.984816] i40e 0000:02:00.0: TX driver issue detected, PF reset issued
[ 1111.987007] bond0: link status up again after 0 ms for interface p1p1
[ 1112.981796] i40e 0000:02:00.0 p1p1: adding 3c:fd:fe:1a:b5:e0 vid=0
[ 1112.981803] i40e 0000:02:00.0 p1p1: adding 3c:fd:fe:1a:b5:e0 vid=5
[ 1112.985812] bond0: link status up again after 0 ms for interface p1p1
[ 1114.204548] i40e 0000:02:00.1: TX driver issue detected, PF reset issued
[ 1114.983686] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=0
[ 1114.983688] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=5
[ 1114.985692] bond0: link status up again after 0 ms for interface p1p2
[ 1115.752686] i40e 0000:02:00.1: TX driver issue detected, PF reset issued
[ 1116.985619] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=0
[ 1116.985624] i40e 0000:02:00.1 p1p2: adding 3c:fd:fe:1a:b5:e0 vid=5
[ 1116.988361] i40e 0000:02:00.1 p1p2: speed changed to 0 for port p1p2
[ 1116.989607] bond0: link status up again after 0 ms for interface p1p2

# uname -a
Linux lb05 4.4.0-97-generic #120-Ubuntu SMP Tue Sep 19 17:28:18 UTC 2017 x86_64 
x86_64 x86_64 GNU/Linux

# modinfo i40e
filename:       
/lib/modules/4.4.0-97-generic/kernel/drivers/net/ethernet/intel/i40e/i40e.ko
version:        1.4.25-k

As a workaround we're using i40e driver v2.0.30 via dkms, which does
works fine without any issues so far, but it would be nice to have this
problem fixed properly :-)

If we're going about this in the wrong way, and our problem is not
applicable to this fix, please let us know. We're happy to test new
patches if there are any.

We're gonna test the HWE 4.10 kernel mentioned and see how that behaves.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1713553

Title:
  Intel i40e PF reset due to incorrect MDD detection

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Xenial:
  Fix Released

Bug description:
  [Impact]

  Using an Intel i40e network device, under heavy traffic load with
  TSO enabled, the device will spontaneously reset itself and issue errors
  similar to the following:

  Jun 14 14:09:51 hostname kernel: [4253913.851053] i40e 0000:05:00.1: TX 
driver issue detected, PF reset issued
  Jun 14 14:09:53 hostname kernel: [4253915.476283] i40e 0000:05:00.1: TX 
driver issue detected, PF reset issued
  Jun 14 14:09:54 hostname kernel: [4253917.411264] i40e 0000:05:00.1: TX 
driver issue detected, PF reset issued

   This causes a full reset of the PF, which causes an interruption
  in traffic flow.

  This was partially fixed by Xenial commit
  12f8cc59d5886b86372f45290166deca57a60d7a, however there is one
  additional upstream commit required to fully fix the issue:

  commit 841493a3f64395b60554afbcaa17f4350f90e764
  Author: Alexander Duyck <alexander.h.du...@intel.com>
  Date:   Tue Sep 6 18:05:04 2016 -0700

      i40e: Limit TX descriptor count in cases where frag size is
  greater than 16K

   This fix was never backported into the Xenial 4.4 kernel series, but
  is already present in the Xenial HWE (and Zesty) 4.10 kernel.

  [Testcase]

   In this case, the issue occurs at a customer site using i40e based
  Intel network cards with SR-IOV enabled. Under heavy load, the card will
  reset itself as described.

  [Regression Potential]

  As with any change to a network card driver, this may cause
  regressions with network I/O through i40e card(s).  However, this
  specific change only increases the likelyhood that any specific large
  TSO tx will need to be linearized, which will avoid the PF reset.
  Linearizing a TSO tx that did not need to be linearized will not cause
  any failures, it may only decrease performance slightly.  However this
  patch should only cause linearization when required to avoid the MDD
  detection and PF reset.

  [Other Info]

  The previous bug for this issue is bug 1700834.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1713553/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to