Alex Sidorenko <alexandre.sidore...@hpe.com> wrote: >On 11/02/2017 12:51 AM, Jay Vosburgh wrote: >> Jarod Wilson <ja...@redhat.com> wrote: >> >>> On 2017-11-01 8:35 PM, Jay Vosburgh wrote: >>>> Jay Vosburgh <jay.vosbu...@canonical.com> wrote: >>>> >>>>> Alex Sidorenko <alexandre.sidore...@hpe.com> wrote: >>>>> >>>>>> The problem has been found while trying to deploy RHEL7 on HPE Synergy >>>>>> platform, it is seen both in customer's environment and in HPE test lab. >>>>>> >>>>>> There are several bonds configured in TLB mode and miimon=100, all other >>>>>> options are default. Slaves are connected to VirtualConnect >>>>>> modules. Rebooting a VC module should bring one bond slave (ens3f0) down >>>>>> temporarily, but not another one (ens3f1). But what we see is >>>>>> >>>>>> Oct 24 10:37:12 SYDC1LNX kernel: bond0: link status up again after 0 ms >>>>>> for interface ens3f1 >>>> >>>> In net-next, I don't see a path in the code that will lead to >>>> this message, as it would apparently require entering >>>> bond_miimon_inspect in state BOND_LINK_FAIL but with downdelay set to 0. >>>> If downdelay is 0, the code will transition to BOND_LINK_DOWN and not >>>> remain in _FAIL state. >>> >>> The kernel in question is laden with a fair bit of additional debug spew, >>> as we were going back and forth, trying to isolate where things were going >>> wrong. That was indeed from the BOND_LINK_FAIL state in >>> bond_miimon_inspect, inside the if (link_state) clause though, so after >>> commit++, there's a continue, which ... does what now? Doesn't it take us >>> back to the top of the bond_for_each_slave_rcu() loop, so we bypass the >>> next few lines of code that would have led to a transition to >>> BOND_LINK_DOWN? >> >> Just to confirm: your downdelay is 0, correct? > >Correct. > >> >> And, do you get any other log messages other than "link status >> up again after 0 ms"? > >Yes, here are some messages (from an early instrumentation): [...] >That is, we never see ens3f1 going to BOND_LINK_DOWN and it continues >staying in BOND_LINK_NOCHANGE/BOND_LINK_FAIL > > >> >> To answer your question, yes, the "if (link_state) {" block in >> the BOND_LINK_FAIL case of bond_miimon_inspect ends in continue, but >> this path is nominally for the downdelay logic. If downdelay is active >> and the link recovers before the delay expires, the link should never >> have moved to BOND_LINK_DOWN. The commit++ causes bond_miimon_inspect >> to return nonzero, causing in turn the bond_propose_link_state change to >> BOND_LINK_FAIL state to be committed. This path deliberately does not >> set slave->new_link, as downdelay is purposely delaying the transition >> to BOND_LINK_DOWN. >> >> If downdelay is 0, the slave->link should not persist in >> BOND_LINK_FAIL state; it should set new_link = BOND_LINK_DOWN which will >> cause a transition in bond_miimon_commit. The bond_propose_link_state >> call to set BOND_LINK_FAIL in the BOND_LINK_UP case will be committed in >> bond_mii_monitor prior to calling bond_miimon_commit, which will in turn >> do the transition to _DOWN state. In this case, the BOND_LINK_FAIL case >> "if (link_state) {" block should never be entered. > >I totally agree with your description of transition logic, and this is why >we were puzzled by how this can occur until we noticed NetworkManager >messages around this time and decided to run a test without it. >Without NM, everything works as expected. After that, adding more >instrumentation, we have found that we do not propose BOND_LINK_FAIL inside >bond_miimon_inspect() but elsewhere (NetworkManager?).
I think I see the flaw in the logic. 1) bond_miimon_inspect finds link_state = 0, then makes a call to bond_propose_link_state(BOND_LINK_FAIL), setting link_new_state to BOND_LINK_FAIL. _inspect then sets slave->new_link = BOND_LINK_DOWN and returns non-zero. 2) bond_mii_monitor rtnl_trylock fails, it reschedules. 3) bond_mii_monitor runs again, and calls bond_miimon_inspect. 4) the slave's link has recovered, so link_state != 0. slave->link is still BOND_LINK_UP. The slave's link_new_state remains set to BOND_LINK_FAIL, but new_link is reset to NOCHANGE. bond_miimon_inspect returns 0, so nothing is committed. 5) step 4 can repeat indefinitely. 6) eventually, the other slave does something that causes commit++, making bond_mii_monitor call bond_commit_link_state and then bond_miimon_commit. The slave in question from steps 1-4 still has link_new_state as BOND_LINK_FAIL, but new_link is NOCHANGE, so it ends up in BOND_LINK_FAIL state. I think step 6 could also occur concurrently with the initial pass through step 4 to induce the problem. It looks like Mahesh mostly fixed this in commit fb9eb899a6dc663e4a2deed9af2ac28f507d0ffb Author: Mahesh Bandewar <mahe...@google.com> Date: Tue Apr 11 22:36:00 2017 -0700 bonding: handle link transition from FAIL to UP correctly but the window still exists, and requires the slave link state to change between the failed rtnl_trylock and the second pass through _inspect. The problem is that a state transition has been kept between invocations to _inspect, but the condition that induced the transition has changed. I haven't tested these, but I suspect the solution is either to clear link_new_state on entry to the loop in bond_miimon_inspect, or merge new_state and link_new_state as a single "next state" (which is cleared on entry to the loop). The first of these is a pretty simple patch: diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 18b58e1376f1..6f89f9981a6c 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2046,6 +2046,7 @@ static int bond_miimon_inspect(struct bonding *bond) bond_for_each_slave_rcu(bond, slave, iter) { slave->new_link = BOND_LINK_NOCHANGE; + slave->link_new_state = slave->link; link_state = bond_check_dev_link(bond, slave->dev, 0); Alex / Jarod, could you check my logic, and would you be able to test this patch if my analysis appears sound? Thanks, -J --- -Jay Vosburgh, jay.vosbu...@canonical.com