Hi, all.  BFD is well known for what it brings to the table for improving link 
failure detection; however, even at a reasonably athletic 300ms Control rate, 
you're not going to catch a significant percentage of brownout situations where 
you have packet loss but not a full outage.  I'm trying to:


  1.  find any formal or semi-formal writing about quantification of BFD's 
effectiveness.  For example, my mental picture is a 3D graph where, for a given 
Control rate and corresponding Detection Time, the X axis is percentage of 
packet loss, the Y axis is the Control/Detection timer tuple, and the Z axis is 
the likelihood that BFD will fully engage (i.e., missing all three Control 
packets).  Beyond what I believe is a visualization complexity needing some 
single malt scotch nearby, letting even a single Control packet through resets 
your Detection timer.
  2.  ask if folks in the Real World use BFD towards this end, or have other 
mechanisms as a data plane loss instrumentation vehicle.  For example, in my 
wanderings, I've found an environment that offloads the diagnostic load to 
adjacent compute nodes, but they reach out to orchestration to trigger further 
router actions in a full-circle cycle measured in minutes.  Short of that, 
really aggressive timers (solving through brute force) on BFD quickly hit 
platform limits for scale unless perhaps you can offboard the BFD to something 
inline (e.g. the Ciena 5170 can be dialed down to a 3.3ms Control timer).

Any thoughts appreciated.  I'm also pursuing ways of having my internal 
"customer" signal me upon their own packet loss observation (e.g. 1% loss for 
most folks is a TCP retransmission, but 1% loss for them are crying eyeballs 
and an escalation).

-dp

Reply via email to