Hello,

I've been testing my LAG implemented with the DPDK eth_bond pmd. As part of my 
fault tolerance testing, I want to ensure that if a link is flapping up and 
down continuously, impact to service is minimal. My findings are that in this 
case, the lag is rendered inoperable if a certain link is flapping. Details 
below.

Setup:

-    4x10G X710 links in a 8023ad lag connected to a switch.

-    Under normal operations, lag is steady, traffic balanced, etc
Problem:
If I take down a link on the switch corresponding to the "aggregator" link in 
the dpdk lag, then bring it back up, every link in the lag goes from 
distributing to not distributing to back to distributing. This causes 
unnecessary loss of service.
A single link failure, regardless of whether or not it's the aggregator link, 
should not change the state of the other links. Consider what would happen if 
there were a hardware fault on that link, or its signal were bad: it's possible 
for it to be stuck flapping up and down. This would lead to complete loss of 
service on the lag, despite there being three stable links remaining.
Analysis:
- The switch is showing that the system id is changing when the link flaps. 
It's going from 00:00:00:00:00:00 to the aggregator's mac. This is not good. 
Why is it happening? It's because by default we seem to be using the 
"AGG_BANDWIDTH" selection algorithm, which is broken: It's taking a slave 
index, and using that the index into the 8023ad ports array, which is based on 
the dpdk port number. It should translate it from the slave index into a 
dpdk_port number using the slaves[] array.
- Aside from the above, if you look, the default is supposed to be AGG_STABLE, 
according to bond_mode_8023ad_conf_get_default. However, 
bond_mode_8023ad_conf_assign does not actually copy out the selection 
algorithm, so it just uses 0, which happens to be AGG_BANDWIDTH.
- I fixed the above, but still faced two more issues:
  1) The system ID changes when the aggregator changes, which can lead to the 
problem.
  2) When the link fails, it is "deactivated" in the lag via 
bond_mode_8023ad_deactivate_slave. There is a block in there dedicated to the 
case where the aggregator is disabled. In that case, it explicitly unselects 
each slave sharing that aggregator. This causes
     them to fall back to the DETACHED state in the mux machine -- i.e. they 
are no longer aggregating at all, until the state machine runs through the LACP 
exchange with the partner again.

Possible fix:
1) Change bond_mode_8023ad_conf_assign to actually copy out the selection 
algorithm.
2) Ensure that all members of a LAG have the same system id (i.e. choose the 
LAG's mac address)
3) Do not detach the other members when the aggregator's link state goes down.

Note:

1)  We should fix  AGG_BANDWIDTH and AGG_COUNT separately.

2)  I can't see any reason why the system id should be equal to the mac of the 
aggregator. It's intended to represent the system to which the lag belongs, not 
the aggregator itself. The aggregator is represented by the operational key. 
So, it should be fine to use the LAG's mac address, which is fixed at init, as 
the system id for all possible aggregators.
3) I think not detaching is the correct approach. There is nothing in my 
reading of 802.1Q or 802.1AX' LACP specification that implies we should do 
this. There is a blurb about changes in parameters which lead to the change in 
aggregator forcing the unselected
    transition, but I don't think that needs to apply here. I'm fairly certain 
they're talking about changing the operational key/etc.


How does everyone feel about this? Am I crazy in requiring this functionality? 
What about the proposed fix. Does it sound reasonable, or am I going to break 
the state machine somehow?

Thanks,

Kyle

Reply via email to