Hi Les, All the MTU issues we have seen were on Telco WAN circuits. They were not planned events, and there were no alarms on our side. If there had been alarms, it would have made troubleshooting that much easier as we would know where to focus troubleshooting efforts.
ISIS does support padding, but 30s outage is not an acceptable outage in our network. Additionally, we have WAN circuits that run other routing protocols such as OSPF and eBGP which do not support hello padding. Even if the routing protocol does support padding, we may not want to use aggressive timer as it is a control plane activity. Also, as mentioned previously, if we use minimum 1s protocol timer, we still have about 3s of outage. As per common practice, we leave our protocol timers default and leverage BFD for fast failure detection. Hence, I believe BFD is a very good mechanism to address this issue. I understand some customers want to run very aggressive BFD timers to detect failures quickly (at the expense of higher network churns). We found that we can achieve sub-second convergence with protection using relatively conservative BFD interval of 150msec. Also, as mentioned previously, depending on implementation, the BFD padding support may have very small impact on performance. I would also add that the BFD padding support will be an option on a per interface/neighbor basis. Network Designer who does not have to deal with MTU issue can choose to use the default behavior. They can also enable padding on WAN circuits, and use default for back-back intra-site links. Thanks Albert From: ginsb...@cisco.com At: 10/23/18 19:52:53To: Albert Fu (BLOOMBERG/ 120 PARK ) , rtg-bfd@ietf.org Subject: RE: BFD WG adoption for draft-haas-bfd-large-packets Albert - From: Albert Fu (BLOOMBERG/ 120 PARK) <af...@bloomberg.net> Sent: Tuesday, October 23, 2018 8:45 AM To: rtg-bfd@ietf.org; Les Ginsberg (ginsberg) <ginsb...@cisco.com> Subject: RE: BFD WG adoption for draft-haas-bfd-large-packets Hi Les, Given that it takes relative lengthy time to troubleshoot the MTU issue, and the associated impact on customer traffic, it is important to have a reliable and fast mechanism to detect the issue. [Les:] This is one of the points where we are not in full agreement. I agree you need an easy and reliable way to detect the problem when it occurs. However, I disagree that you need to do this “fast” – when fast is defined as sub-second. You have something that we know only occurs during some maintenance event – which is planned and only occurs “once/day,week”. Checking for this even once/second is overly aggressive. If it came for free, then no reason not to do so. But as this discussion has shown, there are costs/risks. For example, if you were using IS-IS and you detected this within the default adjacency hold time (30 seconds on p2p circuits) – would that be too slow for you? If so, please explain why this is too slow. I think the primary issue here is ease of use and reliability. Whether detection time is one second or one minute seems relatively unimportant. Do you disagree? Les