* David Zimmerman
Hi, all. BFD is well known for what it brings to the table for
improving link failure detection; however, even at a reasonably
athletic 300ms Control rate, you're not going to catch a significant
percentage of brownout situations where you have packet loss but not a
full outage. I'm trying to:
1. find any formal or semi-formal writing about quantification of
BFD's effectiveness. For example, my mental picture is a 3D graph
where, for a given Control rate and corresponding Detection Time,
the X axis is percentage of packet loss, the Y axis is the
Control/Detection timer tuple, and the Z axis is the likelihood
that BFD will fully engage (i.e., missing all three Control
packets). Beyond what I believe is a visualization complexity
needing some single malt scotch nearby, letting even a single
Control packet through resets your Detection timer.
2. ask if folks in the Real World use BFD towards this end, or have
other mechanisms as a data plane loss instrumentation vehicle.
For example, in my wanderings, I've found an environment that
offloads the diagnostic load to adjacent compute nodes, but they
reach out to orchestration to trigger further router actions in a
full-circle cycle measured in /minutes/. Short of that, really
aggressive timers (solving through brute force) on BFD quickly hit
platform limits for scale unless perhaps you can offboard the BFD
to something inline (e.g. the Ciena 5170 can be dialed down to a
3.3ms Control timer).
Any thoughts appreciated. I'm also pursuing ways of having my
internal "customer" signal me upon their own packet loss observation
(e.g. 1% loss for most folks is a TCP retransmission, but 1% loss for
them are crying eyeballs and an escalation).
Hi David.
We're simply monitoring the error counters on our interfaces, as
brownout packet loss due to an unhealthy link usually appears as receive
errors starting to tick up. If this exceeds some certain percentage of
the total pps on the link, we automatically apply the BGP Graceful
Shutdown community to the BGP sessions running on that link, so that it
is automatically drained of production traffic (assuming that healthier
paths remain that the traffic can potentially move to).
This obviously works best if you control and monitor both ends of the
link, and ensure that you always have enough bandwidth on the redundant
path to handle the full load.
Tore