Comments reply about the IETF 115's topic "Requirement & Framework of Fast Fault Detection for IP-based Networks"

Wanghaibo (Rainsword) Thu, 08 Dec 2022 06:09:22 -0800

Dear All,

Here are the minutes and reply of the topic "Requirement & Framework of Fast 
Fault Detection for IP-based Networks". Welcome more discussions.
Related drafts:
https://datatracker.ietf.org/doc/draft-guo-ffd-requirement/
https://datatracker.ietf.org/doc/draft-wang-ffd-framework/


Comments 1：
Jeff: as a participant, I want to talk about machine learning clusters,
the goal is to converge with a number of entities not number of seconds.
The infrastructure is parallel without single point of failure, the goal
is to detect a failure asap and route it in ip network. This is commonly
implemented on hosts today like flow bender or a variety of other
techniques. If you need to notify a controller you're in seconds and
your machine learning job is dead. The requirement is not suitable for
machine learning clusters.
//Answer:
Communication between computing nodes can quickly detect faults by using a 
communication framework. However, for a high-performance computing task, once a 
fault occurs, the entire task needs to be re-scheduled. However, the task 
scheduler usually accesses and schedules the task through the management 
network, and cannot detect the fault. Reassigned tasks may also be assigned to 
nodes with network failures, which continue to cause tasks to fail.

Comments2:
David Black: I'm one of the original designers of NVMe over fabric and
NVMe over TCP transport. I'm surprised the storage network configuration
shown here is unrealistic. The active passive configuration is typically
moved active active these days, which means the second path is active.
if there is a failure on the first path there is an opportunity to
immediately use a second path to get the failure information
communicated without having to go through all switches. That's better
for NVMe because it's not relying on switch interactions. It seems to me
the failure detection is building based on ip accessibility, it's not a
good idea as routing is the authority for the topology, what ip
addresses are reachable. Please don't reinvent that.
//Answer:
 The storage system has an active-active solution. Theoretically, the second 
active path can be used to transfer faults to drive the switchover of the 
source node. However, this can only solve the local link faults of the storage 
node, but cannot solve the problem of unconverged network faults.
In storage application deployment scenarios, independent dual-plane networking 
maybe used. In this deployment, a single-plane device may be faulty. In this 
case, network convergence cannot be performed completely.

Comments3：
The draft labels security consideration as NA, not applicable, which might also 
be not
acceptable. It's a great vector of dos attack. when a switch side
detects a link failure it should turn the link off, so the other end
notice it pretty quickly and you don't have a problem with two ends
disagreeing on link failure.
//Answer:
It is described in the implementation solution, but not in the Security 
section. For security purposes, it is recommended that access devices use Layer 
2 protocols for registration and advertisement. In this way, information can be 
advertised within only one broadcast domain. In addition, switches can use 
policies to control information generation and not be forwarded or flooded in 
the access domain. This ensures the security of the entire solution.

Comments4:
Sasha: (the slide using NVMe as an example) in this configuration, if
sw1 and sw2 were connected, and the hosts are using loopback addresses,
when failure happens the switch would reroute and host remain ignorant
of the failure, that's what network operators would prefer. If the
switches are not connected, personally I see it as a poor network design
and we shouldn't propagate new functions to host.
//Answer :
The solutions shown here are related to network deployment. Currently, 
independent dual-plane networking maybe used. That is, a network is divided 
into two independent planes that are disjoint.
When one plane is faulty, service access can be continued through the other 
plane. This is a key scenario issue we need to address.

Comments5:
Greg    : That's already done in IGP.
Tony Li : The similarities to the UPA work in LSR are not small
//Answer  :
Whether the IGP is extended or based on UPA is an implementation issue. The 
information that needs to be transmitted is not only IP reachability. 
Therefore, a new protocol is recommended for implementation.

Kind regards,
Haibo

_______________________________________________
rtgwg mailing list
rtgwg@ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg

Comments reply about the IETF 115's topic "Requirement & Framework of Fast Fault Detection for IP-based Networks"

Reply via email to