Re: [Lsr] Multiple failures in Dynamic Flooding

Huaimo Chen Mon, 11 Mar 2019 10:08:42 -0700

Hi Tony,

    In summary for multiple failures, two issues below in 
draft-li-lsr-dynamyic-flooding are discussed:


1)      how to determine the current flooding topology is split; and

2)      how to repair/connect the flooding topology split.
For the first issue, the discussions are still going on.
For the second issue, repairing/connecting the flooding topology split through 
Hello protocol extensions does not work.  When a “backup path”/connection of 
multiple hops is needed to connect/repair the flooding topology split, Hello 
can not go beyond one hop, thus can not repair the flooding topology split in 
this case.

>From: Tony Li [mailto:[email protected]] On Behalf Of [email protected]
>Sent: Wednesday, March 6, 2019 10:45 AM
>To: Huaimo Chen <[email protected]>
>Cc: Christian Hopps <[email protected]>; [email protected]; [email protected]; 
>[email protected]
>Subject: Multiple failures in Dynamic Flooding
>
>Hi Huaimo,
>
>>> I’m sorry that you don’t find it useful. Determining the split is trivial: 
>>> when you receive an IIH,
>>> it has a system ID of the another system in it. If that other system is not 
>>> currently part of the
>>> flooding topology, then it is quite clear that it is disconnected from the 
>>> flooding topology.
>>> Repairing the split is done by enabling temporary flooding on the new link.

>>For an adjacency between two nodes is up, the Hello packets exchanged between 
>>them will not change node/system IDs in them.
>>How do you determine that other system is not currently part of the flooding 
>>topology?

>The IIH includes the system ID.  See ISO 10589 v2, section 9.7, field “source 
>Id”.  The local system will have
>a copy of the flooding topology and can easily see if the neighbor was present 
>as of the last FT computation.  If not, then it should be
>added (modulo rate limiting). The local system can also examine it’s own LSDB. 
> If there is no LSP for the neighbor, then it would seem
>highly likely that there is a disconnect and the neighbor should again be 
>added (modulo rate limiting).

>We are not requiring it, but a system could also do a more extensive 
>computation and compare the links between itself and the neighbor
>by tracing the path in the FT and then confirming that each link is up in the 
>LSDB.

It normally takes a long time such as more than ten minutes to age out and 
remove an LSP/LSA for the neighbor from the LSDB even though the neighbor is 
disconnected physically.
How can you decide quickly in tens of milliseconds that the flooding topology 
is disconnected?

>>> There is an issue here that we have not yet resolved, which is the rate 
>>> that new links should be
>>> temporarily added to the flooding topology.  Some believe that adding any 
>>> new link is the
>>> correct thing to do as it minimizes the recovery time. Others feel that 
>>> enabling too many links
>>> could cause a flooding collapse, so link addition should be highly 
>>> constrained. We are still
>>> discussing this and invite the WG’s opinions.

>>The issue is resolved by the solutions in draft-cc-lsr-flooding-reduction.
One solution is below, where the given distance can be adjusted/configured.
If we want every node to flood on all its links, we let the given
>>distance to a big number. If we want the nodes within 2 hops to a failure
>>to flood on all their links, we set the given distance to 2.
   “In one way, when two or more failures on the current flooding
  > >topology occur almost in the same time, each of the nodes within a
  > >given distance (such as 3 hops) to a failure point, floods the link
  > >state (LS) that it receives to all the links (except for the one from
   which the LS is received) until a new flooding topology is built.”


>As we have discussed, this is not a solution. In fact, this is more dangerous 
>than anything else that has been proposed and
>seems highly likely to trigger a cascade failure. You are enabling full 
>flooding for many nodes.  In dense topologies, even
>a radius of 3 is very high.  For example, in a LS topology, a radius of 3 is 
>sufficient to enable full flooding throughout the
>entire topology. If that were stable, we would not need Dynamic Flooding at 
>all.

This full flooding is enabled only for a very short time.
How do you get that this is more dangerous than anything else and seems highly 
likely to trigger a cascade failure? Can you give some explanations in details?

>>Another solution is just adding minimum links temporarily on the flooding
>>topology to repair the split flooding topology until a new flooding topology
>>is built.

>Agreed.  Which links constitute the minimum?  In a general topology, with 
>arbitrary failures that are not distributed globally,
>how do we make a distributed decision about which links to enable? This is the 
>problem that we are trying to solve. And
>we have no oracle to tell us The Right Answer.

We can discuss this after the first method is discussed.

Best Regards,
Huaimo

>Regards,
>Tony

_______________________________________________
Lsr mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/lsr

Re: [Lsr] Multiple failures in Dynamic Flooding

Reply via email to