** Description changed: [Impact] The i40e driver sometimes causes a "malicious device" event that the firmware detects, which causes the firmware to reset the NIC, causing an interruption in the network connection - which can cause further problems, e.g. if the interface is in a bond; the reset will at least cause a temporary interruption in network traffic. [Fix] In the case of MDD events issued for the PF, they are usually the result of a misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't need to issue a reset to the whole NIC, TX hang checks should handle those if necessary. [Test Case] The bug is unfortunately difficult to reproduce, as there's no detailed documentation on how the i40e firmware detects and raises MDDs. We have seen reports of this happening in Xenial and Bionic, for workloads stressing i40e bonds in LACP mode. Reproducing is easily detected, as the network traffic will be interrupted and the system logs will contain a message like: i40e 0000:02:00.1: TX driver issue detected, PF reset issued [Regression Potential] - Since we're removing resets for the NIC, regressions could show up as issues in connectivity after the MDD events are raised. If the firmware expects the whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in networking. + Since we're removing resets for the NIC, regressions could show up as issues in connectivity after the MDD events are raised. If the firmware expects the whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in networking. The potential for this should however be fairly low, as this patch has been present since kernel 5.2 and hasn't seen any fixes or regressions upstream. Basic smoke tests also showed that the driver continues working as expected. == [original description] This is a continuation from bug 1713553 and then bug 1723127; a patch was added in the first bug and then the second bug, to attempt to fix this, and it may have helped reduce the issue but appears not to have fixed it, based on more reports. See bug 1713553 and bug 1723127 for more details.
-- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1772675 Title: i40e PF reset due to incorrect MDD event Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Bionic: In Progress Status in linux source package in Cosmic: Won't Fix Bug description: [Impact] The i40e driver sometimes causes a "malicious device" event that the firmware detects, which causes the firmware to reset the NIC, causing an interruption in the network connection - which can cause further problems, e.g. if the interface is in a bond; the reset will at least cause a temporary interruption in network traffic. [Fix] In the case of MDD events issued for the PF, they are usually the result of a misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't need to issue a reset to the whole NIC, TX hang checks should handle those if necessary. [Test Case] The bug is unfortunately difficult to reproduce, as there's no detailed documentation on how the i40e firmware detects and raises MDDs. We have seen reports of this happening in Xenial and Bionic, for workloads stressing i40e bonds in LACP mode. Reproducing is easily detected, as the network traffic will be interrupted and the system logs will contain a message like: i40e 0000:02:00.1: TX driver issue detected, PF reset issued [Regression Potential] Since we're removing resets for the NIC, regressions could show up as issues in connectivity after the MDD events are raised. If the firmware expects the whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in networking. The potential for this should however be fairly low, as this patch has been present since kernel 5.2 and hasn't seen any fixes or regressions upstream. Basic smoke tests also showed that the driver continues working as expected. == [original description] This is a continuation from bug 1713553 and then bug 1723127; a patch was added in the first bug and then the second bug, to attempt to fix this, and it may have helped reduce the issue but appears not to have fixed it, based on more reports. See bug 1713553 and bug 1723127 for more details. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1772675/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp