Re: [Lsr] Min Links for Multiple Failures on Flooding Topology

Huaimo Chen Fri, 19 Apr 2019 08:25:43 -0700

Hi Tony,

    My explanations are inline below with prefix [HC].

From: Tony Li [mailto:[email protected]] On Behalf Of [email protected]
Sent: Monday, April 15, 2019 11:37 AM
To: Huaimo Chen <[email protected]>
Cc: [email protected]
Subject: Re: [Lsr] Min Links for Multiple Failures on Flooding Topology

Hi Huaimo,

Consider the case that the multiple failures occur at almost the same time and 
there is no other failure following these multiple failures shortly. In this 
case, the backup paths will connect the live end nodes of the failures on the 
FT, and the FT partition is fixed. Note that we assume that the real network 
topology is not split. So there will be working backup paths that connect the 
FT partitions.

I don’t think you can just assume that the network topology is not damaged in 
some way.  After all, the FT partitioned because of real failures.  Your 
algorithm for computing backup paths relies on old information about the 
topology of other partitions.  That information may be out of date and 
incorrect. This can lead to the paths not being viable.

[HC]: In general, using the backup paths for the failures to repair the FT 
partition works well to some extend even though the nodes in one partition have 
some information about the topology that the nodes in another partition do not 
have.

At first, it can survive two or more failures. For example, it can repair the 
FT partition created by the failures of any two links on the FT. It can also 
repair the FT partition caused by the failures of any two links on the FT and 
the failure of any other link if the algorithm for computing backup paths is 
enhanced to consider using diverse links (i.e., the backup paths are computed 
in such a way that one backup path does not share any link with the other 
backup paths if possible).

Secondly, it helps the convergence. The backup paths for a failure are created 
by the nodes close (or say local) to the failure and connect the two 
partitions. After this local repair of the FT partition, the link states in the 
two partitions are synchronized, and the topology is converged. Locally 
repairing the FT partition is faster than the repairing by the area leader.

Moreover, if one extra backup path for a failure is computed and enabled for 
temporary flooding, the FT partition created by one more failure can be 
repaired.

The network topology is damaged.  This is considered in the algorithm for 
computing backup paths.  Every node uses its LSDB for computing backup paths if 
needed. The FT is portioned, but the real network topology is not, and is used. 
For the information about a failure that is on one partition and not on another 
partition, if the part damaged by this failure is not used in any other backup 
paths, then there is no problem; otherwise (i.e., it is used in some backup 
paths), if the damaged part is used only in one backup path, the backup paths 
can survive the damaged part if that one backup path does not share any link 
with the other backup paths.

For the case that the multiple failures occur at almost the same time and then 
other failures happen following the (old) multiple failures shortly, we can 
consider all these failures (i.e., the old multiple failures and the other 
failures including destination node failure) as new multiple failures. For 
these new multiple failures, the backup paths for all these failures will 
connect the live end nodes of the failures on the FT, and the FT partition is 
fixed.

Again, how can you consider any new failures in the computation of the backup 
paths? You have no way of getting information since the FT partitioned.

[HC]: The information about the new failures will flood through the backup 
paths for the old failures if the backup paths are created. Refer to the 
explanation above.

It’s true that just adding one link may not heal the partition.  This is 
exactly why we feel we need to iterate.  Each iteration increases the size of 
the partition (or decreases the cut set, depending on how you want to look at 
it) and because the graph is finite, we are guaranteed to converge.

The convergence in this way may take a long time.

Agreed.  However, the alternatives that we have on the table right now are (a) 
take several iterations, or (b) turn on many links and risk a cascade failure.

Each iteration takes a time Ti. If we need n (n > 1) iterations, then the total 
time for convergence is n times Ti. For centralized flooding reduction, Ti is 
the time including three parts: 1) for the link states (i.e., LSPs or LSAs) 
representing some of the multiple failures to travel to the area leader in the 
network over the FT that is partitioned, 2) the leader to compute and send a 
new FT, and 3) the other nodes  (i.e., non area leader nodes) receive and build 
the new FT.

We agree that it is linear in the time taken to add new nodes to the topology.  
However, due to rate limiting, I suspect that the overall time is more 
constrained by the rate limiting.

After a multiple failure, all nodes can evaluate the state of their adjacencies 
against the current, existing, and now partitioned FT.  Information about the 
partition has been flooded using the current FT, but only within the 
partitions.  Any node with an adjacency outside of its partition can enable 
temporary flooding to one neighbor and begin rate limiting.

While rate limiting is in progress, a new area leader election and FT 
computation can happen within the partition.  This may or may not yield LSDB 
updates.  Similarly, the temporary addition to the flooding topology may or may 
not yield LSDB updates.

Whether the area leader election and FT computation is faster than the rate 
limiting is unknown, since both are implementation dependent.

[HC]: It seems that using the backup paths for the failures to repair the FT 
partition may make the convergence faster. Refer to the second point in the 
first explanation above.
It may be better to use both the backup paths for the failures and the rate 
limiting. The former repairs the partitions locally, and the latter helps the 
former to work better. Thus the convergence will be faster.

Best Regards,
Huaimo

Tony

_______________________________________________
Lsr mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/lsr

Re: [Lsr] Min Links for Multiple Failures on Flooding Topology

Reply via email to