Dear All, Here are the minutes and reply of the topic "Requirement & Framework of Fast Fault Detection for IP-based Networks". Welcome more discussions. Related drafts: https://datatracker.ietf.org/doc/draft-guo-ffd-requirement/ https://datatracker.ietf.org/doc/draft-wang-ffd-framework/
Comments 1: Jeff: as a participant, I want to talk about machine learning clusters, the goal is to converge with a number of entities not number of seconds. The infrastructure is parallel without single point of failure, the goal is to detect a failure asap and route it in ip network. This is commonly implemented on hosts today like flow bender or a variety of other techniques. If you need to notify a controller you're in seconds and your machine learning job is dead. The requirement is not suitable for machine learning clusters. //Answer: Communication between computing nodes can quickly detect faults by using a communication framework. However, for a high-performance computing task, once a fault occurs, the entire task needs to be re-scheduled. However, the task scheduler usually accesses and schedules the task through the management network, and cannot detect the fault. Reassigned tasks may also be assigned to nodes with network failures, which continue to cause tasks to fail. Comments2: David Black: I'm one of the original designers of NVMe over fabric and NVMe over TCP transport. I'm surprised the storage network configuration shown here is unrealistic. The active passive configuration is typically moved active active these days, which means the second path is active. if there is a failure on the first path there is an opportunity to immediately use a second path to get the failure information communicated without having to go through all switches. That's better for NVMe because it's not relying on switch interactions. It seems to me the failure detection is building based on ip accessibility, it's not a good idea as routing is the authority for the topology, what ip addresses are reachable. Please don't reinvent that. //Answer: The storage system has an active-active solution. Theoretically, the second active path can be used to transfer faults to drive the switchover of the source node. However, this can only solve the local link faults of the storage node, but cannot solve the problem of unconverged network faults. In storage application deployment scenarios, independent dual-plane networking maybe used. In this deployment, a single-plane device may be faulty. In this case, network convergence cannot be performed completely. Comments3: The draft labels security consideration as NA, not applicable, which might also be not acceptable. It's a great vector of dos attack. when a switch side detects a link failure it should turn the link off, so the other end notice it pretty quickly and you don't have a problem with two ends disagreeing on link failure. //Answer: It is described in the implementation solution, but not in the Security section. For security purposes, it is recommended that access devices use Layer 2 protocols for registration and advertisement. In this way, information can be advertised within only one broadcast domain. In addition, switches can use policies to control information generation and not be forwarded or flooded in the access domain. This ensures the security of the entire solution. Comments4: Sasha: (the slide using NVMe as an example) in this configuration, if sw1 and sw2 were connected, and the hosts are using loopback addresses, when failure happens the switch would reroute and host remain ignorant of the failure, that's what network operators would prefer. If the switches are not connected, personally I see it as a poor network design and we shouldn't propagate new functions to host. //Answer : The solutions shown here are related to network deployment. Currently, independent dual-plane networking maybe used. That is, a network is divided into two independent planes that are disjoint. When one plane is faulty, service access can be continued through the other plane. This is a key scenario issue we need to address. Comments5: Greg : That's already done in IGP. Tony Li : The similarities to the UPA work in LSR are not small //Answer : Whether the IGP is extended or based on UPA is an implementation issue. The information that needs to be transmitted is not only IP reachability. Therefore, a new protocol is recommended for implementation. Kind regards, Haibo
_______________________________________________ rtgwg mailing list rtgwg@ietf.org https://www.ietf.org/mailman/listinfo/rtgwg