2016-06-24 0:20 GMT+03:00 Jay Vosburgh <jay.vosbu...@canonical.com>: > > Since commit 7bb11dc9f59d ("bonding: unify all places where > actor-oper key needs to be updated."), the logic in bonding to handle > selection between multiple aggregators has not functioned. > > This affects only configurations wherein the bonding slaves > connect to two discrete aggregators (e.g., two independent switches, each > with LACP enabled), thus creating two separate aggregation groups within a > single bond. > > The cause is a change in 7bb11dc9f59d to no longer set > AD_PORT_BEGIN on a port after a link state change, which would cause the > port to be reselected for attachment to an aggregator as if were newly > added to the bond. We cannot restore the prior behavior, as it > contradicts IEEE 802.1AX 5.4.12, which requires ports that "become > inoperable" (lose carrier, setting port_enabled=false as per 802.1AX > 5.4.7) to remain selected (i.e., assigned to the aggregator). As the port > now remains selected, the aggregator selection logic is not invoked. > > A side effect of this change is that aggregators in bonding will > now contain ports that are link down. The aggregator selection logic > does not currently handle this situation correctly, causing incorrect > aggregator selection. > > This patch makes two changes to repair the aggregator selection > logic in bonding to function as documented and within the confines of the > standard: > > First, the aggregator selection and related logic now utilizes the > number of active ports per aggregator, not the number of selected ports > (as some selected ports may be down). The ad_select "bandwidth" and > "count" options only consider ports that are link up. > > Second, on any carrier state change of any slave, the aggregator > selection logic is explicitly called to insure the correct aggregator is > active. > > Reported-by: Veli-Matti Lintu <veli-matti.li...@opinsys.fi> > Fixes: 7bb11dc9f59d ("bonding: unify all places where actor-oper key needs to > be updated.") > Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>
Hi, Thanks for the patch. I have been now testing it and the reselection seems to be working now in most cases, but I hit one case that seems to consistently fail in my test environment. I've been doing most of testing with ad_select=count and this happens with it. I haven't yet done extensive testing with ad_select=stable/bandwidth. The sequence to trigger the failure seems to be: Switch A (Agg ID 2) Switch B (Agg ID 1) enp5s0f0 ens5f0 ens6f0 enp5s0f1 ens5f1 ens6f1 X X - X - - Connection works (Agg ID 2 active) X - - X - - Connection works (Agg ID 1 active) X - - - - - No connection (Agg ID 2 active) I'm also wondering why link down event causes change of aggregator when the active aggregator has the same number of active links than the new aggregator. The situation here clears once a port comes up. Once I hit also problems without disabling all ports on active switch: Switch A (Agg ID 2) Switch B (Agg ID 1) enp5s0f0 ens5f0 ens6f0 enp5s0f1 ens5f1 ens6f1 X X - X - - Connection works (Agg ID 2 active) X - - X - - No connection (Agg ID 1 active) The active switch does not seem to matter either when disabling ports: Switch A (Agg ID 2) Switch B (Agg ID 1) enp5s0f0 ens5f0 ens6f0 enp5s0f1 ens5f1 ens6f1 X - - X X X Connection works (Agg ID 1 active) X - - X - - Connection works (Agg ID 2 active) - - - X - - No connection (Agg ID 1 active) All testing was done with upstream version 4.6.2 with the patch applied. When there is no connection, /proc/net/bonding/bond0 still shows that there is an active aggregator that has a link up, but for some reason no traffic passes through. I added some debugging information in bond_procfs.c and the number of active ports seems to match the enabled ports on switches. I'll continue doing tests with different scenarios and I can also test specific cases if needed. Veli-Matti