SRU request submitted: https://lists.ubuntu.com/archives/kernel-team/2018-October/096376.html
** Also affects: linux (Ubuntu Cosmic) Importance: Undecided Status: New ** Changed in: linux (Ubuntu Cosmic) Status: New => In Progress ** Changed in: linux (Ubuntu Cosmic) Importance: Undecided => Critical ** Changed in: linux (Ubuntu Cosmic) Assignee: (unassigned) => Joseph Salisbury (jsalisbury) ** Description changed: + + == SRU Justification == + The requested commit fixes a regression introduce by mainline commit + 3a2f70331226, in v4.18-rc1. The commit is only needed in Cosmic. Do to + the regression, A Mellanox CX5 stops pinging with rx_wqe_err (mlx5_core) + + == Fix == + 37fdffb217a4 ("net/mlx5: WQ, fixes for fragmented WQ buffers API") + + == Regression Potential == + Low. This commit has been cc'd to stable, so it has had additional + upstream review. + + == Test Case == + A test kernel was built with this patch and tested by the original bug reporter. + The bug reporter states the test kernel resolved the bug. + + + == Comment: #0 - Michael Ranweiler - 2018-10-18 11:34:40 == + ---Problem Description--- + At the system if u do + ethtool -S enP48p1s0f0 | grep wqe_err + rx_wqe_err: 1 + rx0_wqe_err: 0 + rx1_wqe_err: 0 + rx2_wqe_err: 0 + rx3_wqe_err: 1 + rx4_wqe_err: 0 + rx5_wqe_err: 0 + rx6_wqe_err: 0 + rx7_wqe_err: 0 + rx8_wqe_err: 0 + rx9_wqe_err: 0 + rx10_wqe_err: 0 + rx11_wqe_err: 0 + rx12_wqe_err: 0 + rx13_wqe_err: 0 + rx14_wqe_err: 0 + rx15_wqe_err: 0 - ---Problem Description--- - At the system if u do - ethtool -S enP48p1s0f0 | grep wqe_err - rx_wqe_err: 1 - rx0_wqe_err: 0 - rx1_wqe_err: 0 - rx2_wqe_err: 0 - rx3_wqe_err: 1 - rx4_wqe_err: 0 - rx5_wqe_err: 0 - rx6_wqe_err: 0 - rx7_wqe_err: 0 - rx8_wqe_err: 0 - rx9_wqe_err: 0 - rx10_wqe_err: 0 - rx11_wqe_err: 0 - rx12_wqe_err: 0 - rx13_wqe_err: 0 - rx14_wqe_err: 0 - rx15_wqe_err: 0 + Will see that rx side is hitting issue. - Will see that rx side is hitting issue. - - ---Additional Hardware Info--- Mellanox CX5 Ethernet 100G lspci 0030:01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] 0030:01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] - - - Machine Type = P9 - + + Machine Type = P9 + ---Debugger--- A debugger is not configured - + ---Steps to Reproduce--- - Using a CX5 Ethernet 100G card + Using a CX5 Ethernet 100G card lspci 0030:01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] 0030:01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] - just configure IP + just configure IP ifconfig enP48p1s0f0 33.33.33.33 netmask 255.255.255.0 up then partner system configure IP and then try ping -f ping -f 33.33.33.33 PING 33.33.33.33 (33.33.33.33) 56(84) bytes of data. ........................................^C --- 33.33.33.33 ping statistics --- 5413 packets transmitted, 5373 received, 0% packet loss, time 934ms rtt min/avg/max/mdev = 0.015/0.019/0.669/0.010 ms, ipg/ewma 0.172/0.020 ms # ping 33.33.33.33 PING 33.33.33.33 (33.33.33.33) 56(84) bytes of data. ^C --- 33.33.33.33 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 1071ms - then at the recv system then do + then at the recv system then do ethtool -S enP48p1s0f0 | grep wqe_err - rx_wqe_err: 1 - rx0_wqe_err: 0 - rx1_wqe_err: 0 - rx2_wqe_err: 0 - rx3_wqe_err: 1 - rx4_wqe_err: 0 - rx5_wqe_err: 0 - rx6_wqe_err: 0 - rx7_wqe_err: 0 - rx8_wqe_err: 0 - rx9_wqe_err: 0 - rx10_wqe_err: 0 - rx11_wqe_err: 0 - rx12_wqe_err: 0 - rx13_wqe_err: 0 - rx14_wqe_err: 0 - rx15_wqe_err: 0 - you will see rx_wqe_err with a counter non-zero. + rx_wqe_err: 1 + rx0_wqe_err: 0 + rx1_wqe_err: 0 + rx2_wqe_err: 0 + rx3_wqe_err: 1 + rx4_wqe_err: 0 + rx5_wqe_err: 0 + rx6_wqe_err: 0 + rx7_wqe_err: 0 + rx8_wqe_err: 0 + rx9_wqe_err: 0 + rx10_wqe_err: 0 + rx11_wqe_err: 0 + rx12_wqe_err: 0 + rx13_wqe_err: 0 + rx14_wqe_err: 0 + rx15_wqe_err: 0 + you will see rx_wqe_err with a counter non-zero. This is fixed by this patch: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=37fdffb217a45609edccbb8b407d031143f551c0 == Comment: #1 - Carol L. Soto - 2018-10-18 11:46:00 == - I did a git clone to the cosmic tree and loaded the kernel in a system. + I did a git clone to the cosmic tree and loaded the kernel in a system. kernel 4.18.12 and I can recreate it. lspci | grep Mell | grep ConnectX-5 0000:01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] 0000:01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] 0030:01:00.0 Infiniband controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] 0030:01:00.1 Infiniband controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] :~# ethtool -S enp1s0f0 | grep wqe_err - rx_wqe_err: 2 - rx0_wqe_err: 1 - rx1_wqe_err: 1 - rx2_wqe_err: 0 - rx3_wqe_err: 0 - rx4_wqe_err: 0 - rx5_wqe_err: 0 - rx6_wqe_err: 0 - rx7_wqe_err: 0 - rx8_wqe_err: 0 - rx9_wqe_err: 0 - rx10_wqe_err: 0 + rx_wqe_err: 2 + rx0_wqe_err: 1 + rx1_wqe_err: 1 + rx2_wqe_err: 0 + rx3_wqe_err: 0 + rx4_wqe_err: 0 + rx5_wqe_err: 0 + rx6_wqe_err: 0 + rx7_wqe_err: 0 + rx8_wqe_err: 0 + rx9_wqe_err: 0 + rx10_wqe_err: 0 ... - Let me check if the proposed patch needs backport or not. == Comment: #3 - Carol L. Soto - 2018-10-18 13:34:46 == - I was able to apply the proposed patch as it to the cosmic git tree and no issue. (no need to backport) - using a kernel 4.18.12+. + I was able to apply the proposed patch as it to the cosmic git tree and no issue. (no need to backport) + using a kernel 4.18.12+. - With the proposed patch I do not see wqe err and ping does not stop. + With the proposed patch I do not see wqe err and ping does not stop. ethtool -S enp1s0f0 | grep wqe_err - rx_wqe_err: 0 - rx0_wqe_err: 0 - rx1_wqe_err: 0 - rx2_wqe_err: 0 - rx3_wqe_err: 0 - rx4_wqe_err: 0 - rx5_wqe_err: 0 - rx6_wqe_err: 0 - rx7_wqe_err: 0 - rx8_wqe_err: 0 - rx9_wqe_err: 0 - rx10_wqe_err: 0 + rx_wqe_err: 0 + rx0_wqe_err: 0 + rx1_wqe_err: 0 + rx2_wqe_err: 0 + rx3_wqe_err: 0 + rx4_wqe_err: 0 + rx5_wqe_err: 0 + rx6_wqe_err: 0 + rx7_wqe_err: 0 + rx8_wqe_err: 0 + rx9_wqe_err: 0 + rx10_wqe_err: 0 ... -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1799393 Title: Mellanox CX5 stops pinging with rx_wqe_err (mlx5_core) To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1799393/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs