Re: BFD WG adoption for draft-haas-bfd-large-packets

Albert Fu (BLOOMBERG/ 120 PARK) Tue, 23 Oct 2018 11:07:17 -0700

Hi Acee,

Please see comments in-line.

Thanks,

Albert

From: a...@cisco.com At: 10/23/18 13:02:49To:  Albert Fu (BLOOMBERG/ 120 PARK ) 
,  rtg-bfd@ietf.org,  ginsb...@cisco.com
Subject: Re: BFD WG adoption for draft-haas-bfd-large-packets

Hi Albert,  

From: "Albert Fu (BLOOMBERG/ 120 PARK)" <af...@bloomberg.net>
Reply-To: Albert Fu <af...@bloomberg.net>
Date: Tuesday, October 23, 2018 at 12:45 PM
To: "rtg-bfd@ietf.org" <rtg-bfd@ietf.org>, "Les Ginsberg (ginsberg)" 
<ginsb...@cisco.com>, Acee Lindem <a...@cisco.com>
Subject: Re: BFD WG adoption for draft-haas-bfd-large-packets 

Hi Acee,  

You are right in that this issue does not happen frequently, but when it does, 
it is time consuming to troubleshoot and causes unnecessary network downtime  
to some applications (e.g. between two end hosts, some applications worked 
fine, but others would intermittently fail when they tried to send large size 
packets over the failing ECMP path). 

So you’re saying there is a problem where the data plane interfaces do not 
support the configured MTU due to a SW bug? I hope these are not our routers 😉

AF> There's no bug.  

1) The issue we have seen is with the Telco network. The router can happily 
transmit and receive up to configured interface MTU, but the Telco circuit 
fails to support it. One example is when Telco uses L2VPN to deliver the P2P 
service to us, but due to some faults, traffic was re-routed to a 
mis-configured path that did not support our MTU size (e.g. MTU on Telco PE 
router was not increased to account for MPLS headers for the L2VPN service).

2) AFAIK, the OSPF MTU detection is based on checking MTU value in the DBD 
packet, The actual OSPF packet size may be smaller and may not detect data 
plane issue in Telco network during OSPF session establishment.

I believe the OSPF MTU detection is a control plane mechanism to check config, 
and may not necessary detect a data plane MTU issue (since OSPF does not 
support  padding). Also, most of our issues occurred after routing adjacency 
had been established, and without any network alarms. 

Right. However, if the interface is flapped when the MTU changes, OSPF would 
detect dynamic MTU changes (e.g., configuration), that the control plane is 
aware of. 

AF> We have encountered the MTU issue without any interface flaps on our 
routers (no config change on our routers). The MTU issue occurred within the 
Telco network. Note also some Telco providers that provide WAN circuit spanning 
several countries may make use of smaller local providers to provide the last 
mile.  We have seen issues with the smaller providers.

Thanks, 
Acee  

Thanks 

Albert 

From: a...@cisco.com At: 10/23/18 12:30:55 
To: Albert Fu (BLOOMBERG/ 120 PARK ) , rtg-bfd@ietf.org, ginsb...@cisco.com
Subject: Re: BFD WG adoption for draft-haas-bfd-large-packets 

Hi Albert, Les,  

I tend to agree with Les that BFD doesn’t seem like the right protocol for 
this. Note that if you use OSPF as your IGP and flap the interface when the MTU 
changes, you’ll detect MTU mismatches  immediately due to OSPF’s DB exchange 
MTU negotiation. Granted, control plane detection won’t detect data plane bugs 
resulting in MTU fluctuations but I don’t see this as a frequent event. 

Thanks, 
Acee 

From: Rtg-bfd <rtg-bfd-boun...@ietf.org> on behalf of "Albert Fu (BLOOMBERG/ 
120 PARK)" <af...@bloomberg.net>
Reply-To: Albert Fu <af...@bloomberg.net>
Date: Tuesday, October 23, 2018 at 11:44 AM
To: "rtg-bfd@ietf.org" <rtg-bfd@ietf.org>, "Les Ginsberg (ginsberg)" 
<ginsb...@cisco.com>
Subject: RE: BFD WG adoption for draft-haas-bfd-large-packets 

Hi Les,   

Given that it takes relative lengthy time to troubleshoot the MTU issue, and 
the associated impact on customer traffic, it is important to have a reliable 
and fast mechanism to detect the  issue.  

I believe BFD, especially for single hop control-plane independent situation 
(btw, this covers majority of our BFD use case), is indeed an ideal and 
reliable solution for this purpose. It  is also closely tied with the routing 
protocols, and enable traffic to be diverted very quickly.  

The choice of BFD timer is also one of the design tradeoffs - low BFD detection 
timer will cause more network churns. We do not need extremely aggressive BFD 
timer to achieve fast convergence.  For example, with protection, we can 
achieve end to end sub-second convergence by using relatively high BFD interval 
of 150ms.  

In the case where the path will be used for a variety of encapsulations (e.g. 
Pure IP and L3VPN traffic), we would set the BFD padding to cater for the 
largest possible payload. So, in our  case, our link needs to carry a mix of 
pure IP (1500 max payload) and MPLS traffic (1500 + 3 headers), we would set 
the padding so that the total padded BFD packet size is 1512 bytes.  

As you rightly pointed out, ISIS routing protocol does support hello padding, 
but since this is a control plane process, we can not use aggressive timer. The 
lowest hello interval the can  be configured is 1s, so with default multiplier 
of 3, the best we can achieve is 3s detection time. 

What we would like is a simple mechanism to validate that a link can indeed 
carry the expected max payload size before we put it into production. If an 
issue occurs where this is no longer  the case (e.g. due to outages or 
re-routing within the Telco circuit), we would like a reliable mechanism to 
detect this, and also divert traffic around the link quickly. I feel BFD is a 
good method for this purpose. 

Thanks 

Albert   

From: ginsb...@cisco.com At: 10/23/18 10:45:02 
To: Albert Fu (BLOOMBERG/ 120 PARK ) , rtg-bfd@ietf.org
Subject: RE: BFD WG adoption for draft-haas-bfd-large-packets 

Albert – 

Please understand that I fully agree with the importance of being able to 
detect/report MTU issues. In my own experience this can be a difficult problem 
to diagnose. You do not have  to convince me that some improvement in 
detection/reporting is needed. The question really is whether using BFD is the 
best option. 

Could you respond to my original questions – particularly why sub-second 
detection of this issue is a requirement? 

For your convenience: 

<snip> 
It has been stated that there is a need for sub-second detection of this 
condition – but I really question that requirement.  
What I would expect is that MTU changes only occur as a result of some 
maintenance operation (configuration change, link addition/bringup, insertion 
of a new box in the physical  path etc.). The idea of using a mechanism which 
is specifically tailored for sub-second detection to monitor something that is 
only going to change occasionally seems inappropriate. It makes me think that 
other mechanisms (some form of OAM, enhancements to  routing protocols to do 
what IS-IS already does J) could be more appropriate  and would still meet the 
operational requirements. 

I have listened to the Montreal recording – and I know there was discussion 
related to these issues (not sending padded packets all the time, use of BFD 
echo, etc.) – but I would  be interested in more discussion of the need for 
sub-second detection. 

Also, given that a path might be used with a variety of encapsulations, how do 
you see such a mechanism being used when multiple BFD clients share the same 
BFD session and their  MTU constraints are different? 
<end snip> 

Thanx. 

   Les

Re: BFD WG adoption for draft-haas-bfd-large-packets

Reply via email to