Re: BFD vs network brownouts

Tore Anderson Thu, 09 Jan 2025 02:06:11 -0800

* David Zimmerman

Hi, all. BFD is well known for what it brings to the table forimproving link failure detection; however, even at a reasonablyathletic 300ms Control rate, you're not going to catch a significantpercentage of brownout situations where you have packet loss but not afull outage. I'm trying to:


 1. find any formal or semi-formal writing about quantification of
    BFD's effectiveness.  For example, my mental picture is a 3D graph
    where, for a given Control rate and corresponding Detection Time,
    the X axis is percentage of packet loss, the Y axis is the
    Control/Detection timer tuple, and the Z axis is the likelihood
    that BFD will fully engage (i.e., missing all three Control
    packets).  Beyond what I believe is a visualization complexity
    needing some single malt scotch nearby, letting even a single
    Control packet through resets your Detection timer.
 2. ask if folks in the Real World use BFD towards this end, or have
    other mechanisms as a data plane loss instrumentation vehicle. 
    For example, in my wanderings, I've found an environment that
    offloads the diagnostic load to adjacent compute nodes, but they
    reach out to orchestration to trigger further router actions in a
    full-circle cycle measured in /minutes/.  Short of that, really
    aggressive timers (solving through brute force) on BFD quickly hit
    platform limits for scale unless perhaps you can offboard the BFD
    to something inline (e.g. the Ciena 5170 can be dialed down to a
    3.3ms Control timer).

Any thoughts appreciated. I'm also pursuing ways of having myinternal "customer" signal me upon their own packet loss observation(e.g. 1% loss for most folks is a TCP retransmission, but 1% loss forthem are crying eyeballs and an escalation).

Hi David.

We're simply monitoring the error counters on our interfaces, asbrownout packet loss due to an unhealthy link usually appears as receiveerrors starting to tick up. If this exceeds some certain percentage ofthe total pps on the link, we automatically apply the BGP GracefulShutdown community to the BGP sessions running on that link, so that itis automatically drained of production traffic (assuming that healthierpaths remain that the traffic can potentially move to).

This obviously works best if you control and monitor both ends of thelink, and ensure that you always have enough bandwidth on the redundantpath to handle the full load.


Tore

Re: BFD vs network brownouts

Reply via email to