Re: Facebook post-mortems...

Mark Tinka Tue, 05 Oct 2021 22:08:40 -0700



On 10/6/21 06:51, Hank Nussbacher wrote:

- "During one of these routine maintenance jobs, a command was issuedwith the intention to assess the availability of global backbonecapacity, which unintentionally took down all the connections in ourbackbone network"
Can anyone guess as to what command FB issued that would cause them towithdraw all those prefixes?

Hard to say, as it seems that the command was innocent enough, perhapsrunning a batch of other sub-commands to check port status, bandwidthutilization, MPLS-TE values, e.t.c. However, sounds like anotherunforeseen bug in the command ran other things, or the cascade processof how the sub-commands were ran caused unforeseen problems.

We shall guess this one forever, as I doubt Facebook will go into thatmuch detail.

What I can tell you is that all the major content providers spend a lotof time, money and effort in automating both capacity planning, as wellas capacity auditing. It's a bit more complex for them, because theirvariables aren't just links and utilization, but also locations, fibreavailability, fibre pricing, capacity lease pricing, the presence ofcarrier-neutral data centres, the presence of exchange points, currentvendor equipment models and pricing, projection of future fibre andcapacity pricing, e.t.c.


It's a totally different world from normal ISP-land.

- "it was not possible to access our data centers through our normalmeans because their networks were down, and second, the total loss ofDNS broke many of the internal tools we’d normally use to investigateand resolve outages like this. Our primary and out-of-band networkaccess was down..."
Does this mean that FB acknowledges that the loss of DNS broke theirOOB access?

I need to put my thinking cap on, but not sure whether running DNS inthe IGP would have been better in this instance.

We run our Anycast DNS network in our IGP, mainly to always guaranteelatency-based routing, but also to ensure that the failure of ahigher-level protocol like BGP does not disconnect internal access thatis needed for troubleshooting and repair. Given the IGP is a much morelower-level routing protocol, it's more likely (not impossible) that itwould not go down with BGP.

In the past, we have, indeed, had BGP issues that allowed us to maintainDNS access internally as the IGP was unaffected.


The final statement from that report is interesting:

    "From here on out, our job is to strengthen our testing,
    drills, and overall resilience to make sure events like this
    happen as rarely as possible."

... which, in my rudimentary translation, means that:

    "There are no guarantees that our automation software will not
    poop cows again, but we hope that when that does happen, we
    shall be able to send our guys out to site much more quickly."

... which, to be fair, is totally understandable. These automationtools, especially in large networks such as BigContent, aresignificantly more fragile the more complex they get, and the more batchtasks they need to perform on various parts of a network of this sizeand scope. It's a pity these automation tools are all homegrown, andcan't be bought "pre-packaged and pre-approved to never fail" from ITSoftware Store down the road. But it's the only way for networks of thiscapacity to operate, and the risk they always sit with for being that large.


Mark.

Re: Facebook post-mortems...

Reply via email to