On 10/6/21 06:51, Hank Nussbacher wrote:
- "During one of these routine maintenance jobs, a command was issued
with the intention to assess the availability of global backbone
capacity, which unintentionally took down all the connections in our
backbone network"
Can anyone guess as to what command FB issued that would cause them to
withdraw all those prefixes?
Hard to say, as it seems that the command was innocent enough, perhaps
running a batch of other sub-commands to check port status, bandwidth
utilization, MPLS-TE values, e.t.c. However, sounds like another
unforeseen bug in the command ran other things, or the cascade process
of how the sub-commands were ran caused unforeseen problems.
We shall guess this one forever, as I doubt Facebook will go into that
much detail.
What I can tell you is that all the major content providers spend a lot
of time, money and effort in automating both capacity planning, as well
as capacity auditing. It's a bit more complex for them, because their
variables aren't just links and utilization, but also locations, fibre
availability, fibre pricing, capacity lease pricing, the presence of
carrier-neutral data centres, the presence of exchange points, current
vendor equipment models and pricing, projection of future fibre and
capacity pricing, e.t.c.
It's a totally different world from normal ISP-land.
- "it was not possible to access our data centers through our normal
means because their networks were down, and second, the total loss of
DNS broke many of the internal tools we’d normally use to investigate
and resolve outages like this. Our primary and out-of-band network
access was down..."
Does this mean that FB acknowledges that the loss of DNS broke their
OOB access?
I need to put my thinking cap on, but not sure whether running DNS in
the IGP would have been better in this instance.
We run our Anycast DNS network in our IGP, mainly to always guarantee
latency-based routing, but also to ensure that the failure of a
higher-level protocol like BGP does not disconnect internal access that
is needed for troubleshooting and repair. Given the IGP is a much more
lower-level routing protocol, it's more likely (not impossible) that it
would not go down with BGP.
In the past, we have, indeed, had BGP issues that allowed us to maintain
DNS access internally as the IGP was unaffected.
The final statement from that report is interesting:
"From here on out, our job is to strengthen our testing,
drills, and overall resilience to make sure events like this
happen as rarely as possible."
... which, in my rudimentary translation, means that:
"There are no guarantees that our automation software will not
poop cows again, but we hope that when that does happen, we
shall be able to send our guys out to site much more quickly."
... which, to be fair, is totally understandable. These automation
tools, especially in large networks such as BigContent, are
significantly more fragile the more complex they get, and the more batch
tasks they need to perform on various parts of a network of this size
and scope. It's a pity these automation tools are all homegrown, and
can't be bought "pre-packaged and pre-approved to never fail" from IT
Software Store down the road. But it's the only way for networks of this
capacity to operate, and the risk they always sit with for being that large.
Mark.