Okay, that's good - thanks for that.
Closing this ticket as invalid.
** Changed in: linux (Ubuntu Focal)
Status: New => Invalid
** Changed in: ubuntu-z-systems
Status: Incomplete => Opinion
** Changed in: ubuntu-z-systems
Status: Opinion => Invalid
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1990275
Title:
[UBUNTU 20.04] Unexpected LAG affinity behaviour with mlx5_core
driver in Ubuntu 20.04
Status in Ubuntu on IBM z Systems:
Invalid
Status in linux package in Ubuntu:
Invalid
Status in linux source package in Focal:
Invalid
Bug description:
== Comment: #0 - KISHORE KUMAR G <[email protected]> - 2022-09-19
04:39:42 ==
---Problem Description---
On a Ubuntu/s390 system that houses a Mellanox CX5 Adapter with two ports
connected to the a pair of TOR switches , act as entry point to cluster of
compute nodes to access public network ( edge node) with following level of mlx
firmware :
ethtool -i p0
driver: mlx5e_rep
version: 5.4.0-104.118-
firmware-version: 16.27.1016 (MT_0000000013)
expansion-rom-version:
bus-info: 0100:00:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
The LAG affinity module of mlx5_core in upstream 5.4 kernel listens to
routing events and sets the LAG affinity accordingly , whereas in one
of custom services has Fabcon service listens to the routing events
and sets the LAG affinity in the mellanox driver accordingly.
The edge node routes defined in compute nodes use both the two interfaces
(port1 -P0 and port2- P1) for the LAG affinity. For instance
10.66.0.170 proto bgp src 10.66.11.43 metric 20
nexthop via 172.31.22.42 dev p0 weight 1
nexthop via 172.31.22.170 dev p1 weight 1
As an example post an edge node bootup , LAG mapping gets converged to use
both port1(P0) and port2 (P1) by default
root@pok1-qz1-sr1-rk011-s20:/# dmesg | grep lag
[ 282.043011] mlx5_core 0100:00:00.0: lag map port 1:2 port 2:2
[ 282.083541] mlx5_core 0100:00:00.0: modify lag map port 1:1 port 2:2
(<------ Both ports are equally mapped)
The issue comes, when the mlx5_core driver cannot derive the LAG
configuration from specific routes. For instance,an operation of
disabling an interface from edge node above (10.66.0.170) or
addition/removal of the interface, causes mlx5_core driver to listen
on the routing change and change the LAG affinity to use a single
network interface only.
In the following example ,a new static route entry to a single
destination (10.66.47.34) is added as below
ip route add 10.66.47.34 proto static src 10.66.11.43 metric 20 via
172.31.22.42 dev p0
Caused the LAG mapping change to port1(p0) as detected as following
root@pok1-qz1-sr1-rk011-s20:/# dmesg | grep lag
[ 282.043011] mlx5_core 0100:00:00.0: lag map port 1:2 port 2:2
[ 282.083541] mlx5_core 0100:00:00.0: modify lag map port 1:1 port 2:2
[ 757.878626] mlx5_core 0100:00:00.0: modify lag map port 1:1 port 2:1
<----mapping directs to go thru P0.
The above behaviour, causes all the traffic in 10.x to use a single network
interface.
The TOR switches (Fabric) doesn't capture or know such a LAG affinity
change and therefore the packets will be dropped on "not in use" interface (
Eg. Port 2 (P1) ).
So the mellanox(mlx5_core) should not be changing the LAG mapping
/config based on the last route event, rather should rely on the
default routes only.
Mellanox agreed to patch this and its is available in 5.15.29 Ubuntu and
5.15.39 respectively
Following are the commits that resolves this issue .
1. net/mlx5e: Lag,Only handle events from highest priority multipath entry .
Available in upstream
Kernel 5.15.29 -
https://github.com/torvalds/linux/commit/ad11c4f1d8fd1f03639460e425a36f7fd0ea83f5
2.net/mlx5e: Lag, Don't skip fib events on current dst .
(5.15.29)https://github.com/torvalds/linux/commit/4a2a664ed87962c4ddb806a84b5c9634820bcf55
)3. net/mlx5e: Lag, Fix fib_info pointer assignment - ( 5.15.39 )
https://github.com/torvalds/linux/commit/a6589155ec9847918e00e7279b8aa6d4c272bea7
4. net/mlx5e: Lag, Fix use-after-free in fib event handler -
(5.15.39)
https://github.com/torvalds/linux/commit/27b0420fd959e38e3500e60b637d39dfab065645
The request is to have the above commits backported in Ubuntu 20.04.x series
including the
Ubuntu 18.04 HWE kernel
Contact Information = Kishore Kumar G/[email protected]
[email protected]
---Additional Hardware Info---
Mellanox CX5 adapter with firmware-version: 16.27.1016 (MT_0000000013)
---uname output---
Linux version version: 5.4.0-104.118
Machine Type = s390x LPAR
---Debugger---
A debugger is not configured
---Steps to Reproduce---
...
"
default proto bgp src 10.66.11.41 metric 20
nexthop via 172.31.22.40 dev p0 weight 1
nexthop via 172.31.22.168 dev p1 weight 1"
......
172.31.22.40/31 dev p0 proto kernel scope link src 172.31.22.41
172.31.22.168/31 dev p1 proto kernel scope link src 172.31.22.169
..
Also we have around 64 SRIOV devices for VM Consumption.
In the above case, the LAG mapping is working as expected as below,
to use both the ports (p0 and p1) for traffic
root@pok1-qz1-sr1-rk011-s20:/# dmesg | grep lag
[ 282.043011] mlx5_core 0100:00:00.0: lag map port 1:2 port 2:2
[ 282.083541] mlx5_core 0100:00:00.0: modify lag map port 1:1 port
2:2 <<<---behavior expected
The issue comes , when we set an additional route to a single IP in the
underlying network with a single/one next hop , we observe that all the traffic
is being shifted to a single next hop port as the example below shows.
root@pok1-qz1-sr1-rk011-s20:/# ip route add 10.66.47.34 proto static src
10.66.11.41 metric 20 via 172.31.22.40 dev p0
root@pok1-qz1-sr1-rk011-s20:/# dmesg | grep lag
[ 282.043011] mlx5_core 0100:00:00.0: lag map port 1:2 port 2:2
[ 282.083541] mlx5_core 0100:00:00.0: modify lag map port 1:1 port
2:2
[ 757.878626] mlx5_core 0100:00:00.0: modify lag map port 1:1 port
2:1 <<<<------- Issue
Stack trace output:
no
Oops output:
no
System Dump Info:
The system is not configured to capture a system dump.
*Additional Instructions for Kishore Kumar G/[email protected]
[email protected]:
-Attach sysctl -a output output to the bug.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1990275/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp