Felix <fei.f...@linux.alibaba.com> wrote: >Dear Mainteners, > >Recently I hit a packet drop issue in bonding driver on Linux 4.9. Please >see details below. Please take a look to see if my understanding is >correct. Many thanks. > >What is the problem? >The bonding driver starts to send packets even if the Partner(Switch)'s >Collecting bit is not enabled yet. Partner would drop all packets until >its Collecting bit is enabled. > >What is the root cuase? >According to LACP spec, the Actor need to check Partner's Sync and >Collecting bits before enable its Distributing bit and Distributing >function. Please see the PIC below.
The diagram you reference is found in 802.1AX-2014 figure 6-21, which shows the state diagram for an independent control implementation, i.e., collecting and distributing are managed independently. However, Linux bonding implements coupled control, which is shown in figure 6-22. Here, there is no Partner.Collecting requirement on the state transition from ATTACHED to COLLECTING_DISTRIBUTING. To quote 802.1AX-2014 6.4.15: As independent control is not possible, the coupled control state machine does not wait for the Partner to signal that collection has started before enabling both collection and distribution. Now, that said, I agree that what you're seeing is likely explained by this behavior, and your fix should resolve the immediate problem (that bonding sends packets before the peer has enabled COLLECTING). However, your fix does put bonding out of compliance with the standard, as it does not really implement COLLECTING and DISTRIBUTING as discrete states. In particular, if the peer in your case were to later clear Partner.Collecting, bonding will not react to this as a figure 6-21 independent control implementation would (which isn't a change from current behavior, but currently this isn't expected). So, in my opinion a patch like this should have a comment attached noting that we are deliberately not in compliance with the standard in this specific situation. The proper fix is to implement figure 6-21 separate state. Lastly, are you able to test and generate a patch against current upstream, instead of 4.9? -J >How to fix? >Please see the diff as following. And the patch is attached. > >--- ../origin/linux-4.9.188/drivers/net/bonding/bond_3ad.c 2019-08-07 >00:29:42.000000000 +0800 >+++ drivers/net/bonding/bond_3ad.c 2019-08-08 23:13:29.015640197 +0800 >@@ -937,6 +937,7 @@ > */ > if ((port->sm_vars & AD_PORT_SELECTED) && > (port->partner_oper.port_state & AD_STATE_SYNCHRONIZATION) && >+ (port->partner_oper.port_state & AD_STATE_COLLECTING) && > !__check_agg_selection_timer(port)) { > if (port->aggregator->is_active) > port->sm_mux_state = > >------ >Thanks, >Felix --- -Jay Vosburgh, jay.vosbu...@canonical.com