On 2017-11-01 8:35 PM, Jay Vosburgh wrote:
Jay Vosburgh <jay.vosbu...@canonical.com> wrote:

Alex Sidorenko <alexandre.sidore...@hpe.com> wrote:

The problem has been found while trying to deploy RHEL7 on HPE Synergy
platform, it is seen both in customer's environment and in HPE test lab.

There are several bonds configured in TLB mode and miimon=100, all other
options are default. Slaves are connected to VirtualConnect
modules. Rebooting a VC module should bring one bond slave (ens3f0) down
temporarily, but not another one (ens3f1). But what we see is

Oct 24 10:37:12 SYDC1LNX kernel: bond0: link status up again after 0 ms for 
interface ens3f1

        In net-next, I don't see a path in the code that will lead to
this message, as it would apparently require entering
bond_miimon_inspect in state BOND_LINK_FAIL but with downdelay set to 0.
If downdelay is 0, the code will transition to BOND_LINK_DOWN and not
remain in _FAIL state.

The kernel in question is laden with a fair bit of additional debug spew, as we were going back and forth, trying to isolate where things were going wrong. That was indeed from the BOND_LINK_FAIL state in bond_miimon_inspect, inside the if (link_state) clause though, so after commit++, there's a continue, which ... does what now? Doesn't it take us back to the top of the bond_for_each_slave_rcu() loop, so we bypass the next few lines of code that would have led to a transition to BOND_LINK_DOWN?

...
        Your patch does not apply to net-next, so I'm not absolutely
sure where this is, but presuming that this is in the BOND_LINK_FAIL
case of the switch, it looks like both BOND_LINK_FAIL and BOND_LINK_BACK
will have the issue that if the link recovers or fails, respectively,
within the delay window (for down/updelay > 0) it won't set a
slave->new_link.

        Looks like this got lost somewhere along the line, as originally
the transition back to UP (or DOWN) happened immediately, and that has
been lost somewhere.

        I'll have to dig out when that broke, but I'll see about a test
patch this afternoon.

        The case I was concerned with was moved around; the proposed
state is committed in bond_mii_monitor.  But to commit to _FAIL state,
the downdelay would have to be > 0.  I'm not seeing any errors in
net-next; can you reproduce your erroneous behavior on net-next?

I can try to get a net-next-ish kernel into their hands, but the bonding driver we're working with here is quite close to current net-next already, so I'm fairly confident the same thing will happen.

--
Jarod Wilson
ja...@redhat.com

Reply via email to