Aichun Li <liaic...@huawei.com> wrote:

>When the network service is repeatedly restarted in 802.3ad, there is a low
> probability that oops occurs.
>Test commands:systemctl restart network
>
>1.crash: __enable_port():port->slave is NULL
[...]
>     PC: ffff000000e2fcd0  [ad_agg_selection_logic+328]
[...]
>2.I also have another call stack, same as in another person's post:
>https://lore.kernel.org/netdev/52630cba-cc60-a024-8dd0-8319e5245...@huawei.com/

        What hardware platform are you using here?

        moyufeng <moyuf...@huawei.com> appears to be using the same
platform, and I've not had any success so far with the provided script
to reproduce the issue.  I'm using an x86_64 system, however, so I
wonder if perhaps this platform needs a barrier somewhere that x86 does
not, or there's something different in the timing of the device driver
close logic.

>Signed-off-by: Aichun Li <liaic...@huawei.com>
>---
> drivers/net/bonding/bond_3ad.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
>index aa001b16765a..9c8894631bdd 100644
>--- a/drivers/net/bonding/bond_3ad.c
>+++ b/drivers/net/bonding/bond_3ad.c
>@@ -183,7 +183,7 @@ static inline void __enable_port(struct port *port)
> {
>       struct slave *slave = port->slave;
> 
>-      if ((slave->link == BOND_LINK_UP) && bond_slave_is_up(slave))
>+      if (slave && (slave->link == BOND_LINK_UP) && bond_slave_is_up(slave))
>               bond_set_slave_active_flags(slave, BOND_SLAVE_NOTIFY_LATER);
> }

        This change seems like a band aid to cover the real problem.
The caller of __enable_port is ad_agg_selection_logic, and it shouldn't
be possible for port->slave to be NULL when assigned to an aggregator.

>@@ -1516,6 +1516,7 @@ static void ad_port_selection_logic(struct port *port, 
>bool *update_slave_arr)
>                                 port->actor_port_number,
>                                 port->aggregator->aggregator_identifier);
>               } else {
>+                      port->aggregator = &(SLAVE_AD_INFO(slave)->aggregator);
>                       slave_err(bond->dev, port->slave->dev,
>                                 "Port %d did not find a suitable 
> aggregator\n",
>                                 port->actor_port_number);

        This change isn't correct; it's assigning the port to a more or
less random aggregator.  This would eliminate the panic, but isn't doing
the right thing.  At this point in the code, the selection logic has
failed to find an aggregator that matches, and also failed to find a
free aggregator.

        I do need to fix up the failure handling here when it hits the
"did not find a suitable agg" case; the code here is simply wrong, and
has been wrong since the beginning.  I'll hack the driver to induce this
situation rather than reproducing whatever problem is making it unable
to find a suitable aggregator.

        -J

---
        -Jay Vosburgh, jay.vosbu...@canonical.com

Reply via email to