On Wed, Dec 23, 2020 at 10:25 AM Jakub Kicinski <k...@kernel.org> wrote: > > On Tue, 22 Dec 2020 10:38:21 -0800 Samuel Mendoza-Jonas wrote: > > On Tue, 2020-12-22 at 06:13 +0000, Joel Stanley wrote: > > > On Sun, 20 Dec 2020 at 12:40, John Wang wrote: > > > > When aggregating ncsi interfaces and dedicated interfaces to bond > > > > interfaces, the ncsi response handler will use the wrong net device > > > > to > > > > find ncsi_dev, so that the ncsi interface will not work properly. > > > > Here, we use the net device registered to packet_type to fix it. > > > > > > > > Fixes: 138635cc27c9 ("net/ncsi: NCSI response packet handler") > > > > Signed-off-by: John Wang <wangzhiqiang...@bytedance.com> > > This sounds like exactly the case for which orig_dev was introduced. > I think you should use the orig_dev argument, rather than pt->dev.
will send a v2 > > Can you test if that works? Yes, it works. > > > > Can you show me how to reproduce this? On g220a, eth1 is the dedicated interface, eth0 is the ncsi interface kernel cfg: CONFIG_BONDING=y cat /etc/systemd/network/00-bmc-bond1.netdev [NetDev] Name=bond1 Description=Bond eth0 and eth1 Kind=bond [Bond] Mode=active-backup cat /etc/systemd/network/00-bmc-eth0.network [Match] Name=eth0 [Network] Bond=bond1 cat /etc/systemd/network/00-bmc-eth0.network [Match] Name=eth1 [Network] Bond=bond1 PrimarySlave=true ip addr .... 6: bond1: <BROADCAST,MULTICAST,UP,LOWER_UP400> mtu 1500 qdisc noqueue qlen 1000 link/ether b4:05:5d:8f:6a:ad brd ff:ff:ff:ff:ff:ff inet 169.254.11.178/16 brd 169.254.255.255 scope link bond1 valid_lft forever preferred_lft forever inet 192.168.1.108/24 brd 192.168.1.255 scope global bond1 valid_lft forever preferred_lft forever inet 10.2.16.118/24 brd 10.2.16.255 scope global bond1 valid_lft forever preferred_lft forever inet6 fe80::b605:5dff:fe8f:6aad/64 scope link ... Without this patch: After bmc boots: echo eth0 > /sys/class/net/bond1/bonding/active_slave admin@g220a:~# admin@g220a:~# echo eth0 > /sys/class/net/bond1/bonding/active_slave [ 105.964357] bond1: (slave eth0): making interface the new active one admin@g220a:~# ping 10.2.16.1 PING 10.2.16.1 (10.2.16.1): 56 data bytes 64 bytes from 10.2.16.1: seq=0 ttl=255 time=7.096 ms 64 bytes from 10.2.16.1: seq=1 ttl=255 time=2.143 ms 64 bytes from 10.2.16.1: seq=2 ttl=255 time=2.111 ms [ 112.642734] ftgmac100 1e660000.ethernet eth0: NCSI Channel 0 timed out! 64 bytes from 10.2.16.1: seq=3 ttl=255 time=2.039 ms 64 bytes from 10.2.16.1: seq=4 ttl=255 time=2.037 ms [ 117.842814] ftgmac100 1e660000.ethernet eth0: NCSI: No channel with link found, configuring channel 0 [ 134.482746] ftgmac100 1e660000.ethernet eth0: NCSI Channel 0 timed out! [ 139.682820] ftgmac100 1e660000.ethernet eth0: NCSI: No channel with link found, configuring channel 0 with this patch: After bmc boots: admin@g220a:~# echo eth0 > /sys/class/net/bond1/bonding/active_slave [58332.123754] bond1: (slave eth0): making interface the new active one admin@g220a:~# ping 10.2.16.1 PING 10.2.16.1 (10.2.16.1): 56 data bytes 64 bytes from 10.2.16.1: seq=0 ttl=255 time=7.279 ms ... ... 64 bytes from 10.2.16.1: seq=N ttl=255 time=2.037 ms > > > > > > I don't know the ncsi or net code well enough to know if this is the > > > correct fix. If you are confident it is correct then I have no > > > objections. > > > > This looks like it is probably right; pt->dev will be the original > > device from ncsi_register_dev(), if a response comes in to > > ncsi_rcv_rsp() associated with a different device then the driver will > > fail to find the correct ncsi_dev_priv. An example of the broken case > > would be good to see though. > > From the description sounds like the case is whenever the ncsi > interface is in a bond, the netdev from the second argument is > the bond not the interface from which the frame came. It should > be possible to repro even with only one interface on the system, > create a bond or a team and add the ncsi interface to it. > > Does that make sense? I'm likely missing the subtleties here. :) I guess so.