I’m “guessing” based on all the services that were impacted the outage was likely cause by a change that caused a routing change in their multi-service network which overloaded many network devices, and by isolating the source the routes or traffic the rest of the network was able to recover.
But just a guess. Shane > On Jul 11, 2022, at 4:22 PM, Matthew Petach <mpet...@netflight.com> wrote: > > > >> On Mon, Jul 11, 2022 at 9:01 AM Andrey Kostin <ank...@podolsk.ru> wrote: >> It's hard to believe that a same time maintenance affecting so many >> devices in the core network could be approved. Core networks are build >> with redundancy, so that failures can't completely destroy the whole >> network. > > I think you might need to re-evaluate your assumption > about how core networks are built. > > A well-designed core network will have layers of redundancy > built in, with easy isolation of fault layers, yes. > > I've seen (and sometimes worked on) too many networks > that didn't have enough budget for redundancy, and were > built as a string of pearls, one router to the next; if any router > in the string of pearls broke, the entire string of pearls would > come crashing down, to abuse a metaphor just a bit too much. > > Really well-thought out redundancy takes a design team that > has enough experience and enough focused hours in the day > to think through different failure modes and lay out the design > ahead of time, before purchases get made. Many real-world > networks share the same engineers between design, deployment, > and operation of the network--and in that model, operation and > deployment always win over design when it comes time to allocate > engineering hours. Likeise, if you didn't have the luxury of being > able to lay out the design ahead of time, before purchasing hardware > and leasing facilities, you're likely doing the best you can with locations > that were contracted before you came into the picture, using hardware > that was decided on before you had an opportunity to suggest better > alternatives. > > Taking it a step further, and thinking about the large Facebook outage, > even if you did well in the design phase, and chose two different vendors, > with hardware redundancy and site redundancy in your entire core > network, did you also think about redundancy and diversity for the > O&M side of the house? Does each redundant data plane have a > diverse control plane and management plane, or would an errant > redistribution of BGP into IGP wipe out both data planes, and both > hardware vendors at the same time? Likewise, if a bad configuration > push isolates your core network nodes from the "God box" that > controls the device configurations, do you have redundancy in > connectivity to that "God box" so that you can restore known-good > configurations to your core network sites, or are you stuck dispatching > engineers with laptops and USB sticks with configs on them to get > back to a working condition again? > > As you follow the control of core networks back up the chain, > you ultimately realize that no network is truly redundant and > diverse. Every network ultimately comes back to a single point > of failure, and the only distinction you can make is how far up the > ladder you climb before you discover that single point of failure. > > Thanks! > > Matt >