Re: [Lsr] Multiple failures in Dynamic Flooding

tony . li Mon, 11 Mar 2019 10:41:27 -0700

Hi Huaimo,



>     In summary for multiple failures, two issues below in 
> draft-li-lsr-dynamyic-flooding are discussed:
> 1)      how to determine the current flooding topology is split; and
> 2)      how to repair/connect the flooding topology split.
> For the first issue, the discussions are still going on.
> For the second issue, repairing/connecting the flooding topology split 
> through Hello protocol extensions does not work.  When a “backup 
> path”/connection of multiple hops is needed to connect/repair the flooding 
> topology split, Hello can not go beyond one hop, thus can not repair the 
> flooding topology split in this case.


You do not try to repair things remotely, they are always repaired locally.  If 
there are multiple failures in the flooding topology and it is partitioned, 
then it follows that there are multiple remaining connected components of the 
flooding topology.  Nodes that are adjacent to the failures will update their 
LSPs and flood them throughout their connected component.  Each component will 
see at least two link failures if there is a partition of the FT and each node 
in the component can detect that the FT has partitioned.  Each node is then 
capable of enabling temporary flooding on one or more links that will traverse 
the partition, thereby restoring a functioning FT.  The Area Leader then 
recomputes and redistributes the revised FT.

To put it yet another way, repair is fully distributed.  You should like that.  
:-)


> >We are not requiring it, but a system could also do a more extensive 
> >computation and compare the links between itself and the neighbor
> >by tracing the path in the FT and then confirming that each link is up in 
> >the LSDB.
>  
> It normally takes a long time such as more than ten minutes to age out and 
> remove an LSP/LSA for the neighbor from the LSDB even though the neighbor is 
> disconnected physically.
> How can you decide quickly in tens of milliseconds that the flooding topology 
> is disconnected?


You do not wait for LSP/LSA removal.  You look for link changes in the LSPs 
that you do get, or local link changes.


> >As we have discussed, this is not a solution. In fact, this is more 
> >dangerous than anything else that has been proposed and
> >seems highly likely to trigger a cascade failure. You are enabling full 
> >flooding for many nodes.  In dense topologies, even
> >a radius of 3 is very high.  For example, in a LS topology, a radius of 3 is 
> >sufficient to enable full flooding throughout the
> >entire topology. If that were stable, we would not need Dynamic Flooding at 
> >all.
>  
> This full flooding is enabled only for a very short time.


All it takes is enabling it at sufficient density to create a cascade failure.  
Milliseconds are sufficient for a collapse.


> How do you get that this is more dangerous than anything else and seems 
> highly likely to trigger a cascade failure? Can you give some explanations in 
> details?


Again, we do not have absolute metrics on what triggers a cascade failure 
today.  We have several data points of several different implementations at 
different points in time.  We know that in the early ‘90s, a full mesh of 20 
neighbors running L1L2 was sufficient.  Obviously things have changed somewhat, 
but even more modern implementations have had problems.  This is why the MSDC 
went to BGP.

As a result, we need to be very conservative about what flooding we temporarily 
enable.  We do not want to walk anywhere near the cliff, as the cascade failure 
is fatal to the network.

Tony

_______________________________________________
Lsr mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/lsr

Re: [Lsr] Multiple failures in Dynamic Flooding

Reply via email to