Aichun Li <liaic...@huawei.com> wrote: >When the network service is repeatedly restarted in 802.3ad, there is a low > probability that oops occurs. >Test commands:systemctl restart network > >1.crash: __enable_port():port->slave is NULL [...] > PC: ffff000000e2fcd0 [ad_agg_selection_logic+328] [...] >2.I also have another call stack, same as in another person's post: >https://lore.kernel.org/netdev/52630cba-cc60-a024-8dd0-8319e5245...@huawei.com/
What hardware platform are you using here? moyufeng <moyuf...@huawei.com> appears to be using the same platform, and I've not had any success so far with the provided script to reproduce the issue. I'm using an x86_64 system, however, so I wonder if perhaps this platform needs a barrier somewhere that x86 does not, or there's something different in the timing of the device driver close logic. >Signed-off-by: Aichun Li <liaic...@huawei.com> >--- > drivers/net/bonding/bond_3ad.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > >diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c >index aa001b16765a..9c8894631bdd 100644 >--- a/drivers/net/bonding/bond_3ad.c >+++ b/drivers/net/bonding/bond_3ad.c >@@ -183,7 +183,7 @@ static inline void __enable_port(struct port *port) > { > struct slave *slave = port->slave; > >- if ((slave->link == BOND_LINK_UP) && bond_slave_is_up(slave)) >+ if (slave && (slave->link == BOND_LINK_UP) && bond_slave_is_up(slave)) > bond_set_slave_active_flags(slave, BOND_SLAVE_NOTIFY_LATER); > } This change seems like a band aid to cover the real problem. The caller of __enable_port is ad_agg_selection_logic, and it shouldn't be possible for port->slave to be NULL when assigned to an aggregator. >@@ -1516,6 +1516,7 @@ static void ad_port_selection_logic(struct port *port, >bool *update_slave_arr) > port->actor_port_number, > port->aggregator->aggregator_identifier); > } else { >+ port->aggregator = &(SLAVE_AD_INFO(slave)->aggregator); > slave_err(bond->dev, port->slave->dev, > "Port %d did not find a suitable > aggregator\n", > port->actor_port_number); This change isn't correct; it's assigning the port to a more or less random aggregator. This would eliminate the panic, but isn't doing the right thing. At this point in the code, the selection logic has failed to find an aggregator that matches, and also failed to find a free aggregator. I do need to fix up the failure handling here when it hits the "did not find a suitable agg" case; the code here is simply wrong, and has been wrong since the beginning. I'll hack the driver to induce this situation rather than reproducing whatever problem is making it unable to find a suitable aggregator. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com