From: Tuong Lien <tuong.t.l...@dektech.com.au> Date: Mon, 17 Jun 2019 11:56:12 +0700
> It appears that a FAILOVER_MSG can come from peer even when the failure > link is resetting (i.e. just after the 'node_write_unlock()'...). This > means the failover procedure on the node has not been started yet. > The situation is as follows: ... > Once this happens, the link failover procedure will be triggered > wrongly on the receiving node since the node isn't in FAILINGOVER state > but then another link failover will be carried out. > The consequences are: > > 1) A peer might get stuck in FAILINGOVER state because the 'sync_point' > was set, reset and set incorrectly, the criteria to end the failover > would not be met, it could keep waiting for a message that has already > received. > > 2) The early FAILOVER_MSG(s) could be queued in the link failover > deferdq but would be purged or not pulled out because the 'drop_point' > was not set correctly. > > 3) The early FAILOVER_MSG(s) could be dropped too. > > 4) The dummy FAILOVER_MSG could make the peer leaving FAILINGOVER state > shortly, but later on it would be restarted. > > The same situation can also happen when the link is in PEER_RESET state > and a FAILOVER_MSG arrives. > > The commit resolves the issues by forcing the link down immediately, so > the failover procedure will be started normally (which is the same as > when receiving a FAILOVER_MSG and the link is in up state). > > Also, the function "tipc_node_link_failover()" is toughen to avoid such > a situation from happening. > > Acked-by: Jon Maloy <jon.ma...@ericsson.se> > Signed-off-by: Tuong Lien <tuong.t.l...@dektech.com.au> Applied, thank you.