Hi Huaimo,

> I don’t think you can just assume that the network topology is not damaged in 
> some way.  After all, the FT partitioned because of real failures.  Your 
> algorithm for computing backup paths relies on old information about the 
> topology of other partitions.  That information may be out of date and 
> incorrect. This can lead to the paths not being viable.
>  
> [HC]: In general, using the backup paths for the failures to repair the FT 
> partition works well to some extend even though the nodes in one partition 
> have some information about the topology that the nodes in another partition 
> do not have. 
>  
> At first, it can survive two or more failures. For example, it can repair the 
> FT partition created by the failures of any two links on the FT. It can also 
> repair the FT partition caused by the failures of any two links on the FT and 
> the failure of any other link if the algorithm for computing backup paths is 
> enhanced to consider using diverse links (i.e., the backup paths are computed 
> in such a way that one backup path does not share any link with the other 
> backup paths if possible). 


Well, I’m still not seeing how this works.  AFAICT, when there is a partition, 
your proposal is to compute a shortest path between the previously connected 
nodes using link state information that is now out of date.  What you compute 
is a multi-link path, tho it’s not clear at all how this is enabled.  Since the 
FT is down, there is no way to signal to the other half of the partition about 
what links are requested.  

At best, it seems like you’re having one partition-edge router enable one link 
towards the other partition.


> Secondly, it helps the convergence. The backup paths for a failure are 
> created by the nodes close (or say local) to the failure and connect the two 
> partitions. After this local repair of the FT partition, the link states in 
> the two partitions are synchronized, and the topology is converged. Locally 
> repairing the FT partition is faster than the repairing by the area leader.


The area leader cannot repair the partition, since the FT cannot be propagated 
across the partition.  This is why enabling links across the partition edge is 
necessary.

It is not clear that your backup path computation (an SPF?) is any faster (or 
slower) than determining whether a neighbor is in a separate partition.


> The network topology is damaged.  This is considered in the algorithm for 
> computing backup paths.


Only partially.  Your algorithm does not know what the damage is in the other 
half of the partition.  There is no way to get it this information. As a 
result, the backup path that’s computed may traverse nodes and links in the 
other half of the partition that are no longer functional. It may also 
completely ignore links and nodes that could heal the partition.


> Every node uses its LSDB for computing backup paths if needed. The FT is 
> portioned, but the real network topology is not, and is used. For the 
> information about a failure that is on one partition and not on another 
> partition, if the part damaged by this failure is not used in any other 
> backup paths, then there is no problem; otherwise (i.e., it is used in some 
> backup paths), if the damaged part is used only in one backup path, the 
> backup paths can survive the damaged part if that one backup path does not 
> share any link with the other backup paths. 


It seems to me that this leads to enabling flooding throughout the entire 
topology, leading us back to cascade failure.


> Again, how can you consider any new failures in the computation of the backup 
> paths? You have no way of getting information since the FT partitioned.
>  
> [HC]: The information about the new failures will flood through the backup 
> paths for the old failures if the backup paths are created. Refer to the 
> explanation above.


This reasoning is circular.  You have still not shown that the backup paths 
will repair the failures.


> [HC]: It seems that using the backup paths for the failures to repair the FT 
> partition may make the convergence faster. Refer to the second point in the 
> first explanation above.
> It may be better to use both the backup paths for the failures and the rate 
> limiting. The former repairs the partitions locally, and the latter helps the 
> former to work better. Thus the convergence will be faster.


You have not shown that the backup paths will provide any benefit at all, much 
less improve convergence.  First and foremost, we need an algorithm that will 
lead to partition repair.  

We still need you to provide a specific, detailed proposal that we all agree 
will heal the partition.

Instead of continuing to argue about the merits of your proposal at a high 
level, I suggest we return to specific examples.  I think that you’ll find that 
as we work through them that the greedy algorithm described in the draft is 
very difficult to improve upon in the general case.

Regards,
Tony


_______________________________________________
Lsr mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/lsr

Reply via email to