In this specific event, 3356 not withdrawing routes is certainly a head scratcher, and I'm sure for many the thing we're most looking forward to a definitive answer on.
However, if a network only has 3356 as their upstream, they are 100% at the mercy of 3356 at all times. Having a redundant AND diverse connection to a 2nd upstream ASN at least provides you some options. In this case for example, let's say at all times you did a +2 prepend to both 3356 and Acme. 3356 even happens, you shut down your session to them. Some percentage of your traffic that would have been faceplanting in/through 3356 now works via Acme. Then you notice the non-withdrawl issue. You can then remove 1 prepend, or perhaps deagg strategically to try and get more traffic away from the trouble. A redundant path to a different.upstream at least provides you some potential options to work around that with which you otherwise could not. It wouldn't be perfect, but options > no options. On Mon, Aug 31, 2020 at 5:08 PM Warren Kumari <war...@kumari.net> wrote: > On Mon, Aug 31, 2020 at 4:36 PM Tom Beecher <beec...@beecher.cc> wrote: > > > > Hopefully those customers learned the difference between redundancy and > diversity this weekend. :) > > I'm unclear how either solves things for many customers... > > If they had CenturyLink and AcmeNetworkWidgets, and announce the same > network through both -- and their connection to CL went down, *but CL > continues to announce / doesn't withdraw* they are still stuck, yes? > (Unless they can deaggregate that is...) > What am I missing? > > W > > > > > > On Mon, Aug 31, 2020 at 3:48 PM Eric Kuhnke <eric.kuh...@gmail.com> > wrote: > >> > >> There's a number of enterprise end user type customers of 3356 that > have on-premises server rooms/hosting for their stuff. And they spend a lot > of money every month for a 'redundant' metro ethernet circuit that takes > diverse fiber paths from their business park office building to the local > clink/level3 POP. But all that last mile redundancy and fail over ability > doesn't do much for them when 3356 breaks its network at the BGP level. > >> > >> > >> > >> On Mon, Aug 31, 2020 at 9:36 AM Drew Weaver <drew.wea...@thenap.com> > wrote: > >>> > >>> I also found the part where they mention that a lot of hosting > companies only have one uplink to be quizzical and also the fact that he > goes pretty close to implying that its Centurylink’s customers fault for > not having multiple paths to Cloudflare that don’t touch Centurylink a bit > puzzling. It could have just been poorly written. > >>> > >>> > >>> > >>> > >>> > >>> From: NANOG <nanog-bounces+drew.weaver=thenap....@nanog.org> On > Behalf Of Tom Beecher > >>> Sent: Monday, August 31, 2020 9:26 AM > >>> To: Hank Nussbacher <h...@interall.co.il> > >>> Cc: NANOG <nanog@nanog.org> > >>> Subject: Re: Centurylink having a bad morning? > >>> > >>> > >>> > >>> > https://blog.cloudflare.com/analysis-of-todays-centurylink-level-3-outage/ > >>> > >>> > >>> > >>> I definitely found Mr. Prince's writing about yesterday's events > fascinating. > >>> > >>> > >>> > >>> Verizon makes a mistake with BGP filters that allows a secondary > mistake from leaked "optimizer" routes to propagate, and Mr. Prince takes > every opportunity to lob large chunks of granite about how terrible they > are. > >>> > >>> > >>> > >>> L3 allows an erroneous flowspec announcement to cause massive global > connectivity issues, and Mr. Prince shrugs and says "Incidents happen." > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> On Mon, Aug 31, 2020 at 1:15 AM Hank Nussbacher <h...@interall.co.il> > wrote: > >>> > >>> On 30/08/2020 20:08, Baldur Norddahl wrote: > >>> > >>> > >>> > >>> > https://blog.cloudflare.com/analysis-of-todays-centurylink-level-3-outage/ > >>> > >>> > >>> > >>> Sounds like Flowspec possibly blocking tcp/179 might be the cause. > >>> > >>> > >>> > >>> But that is Cloudflare speculation. > >>> > >>> > >>> > >>> Regards, > >>> Hank > >>> > >>> Caveat: The views expressed above are solely my own and do not express > the views or opinions of my employer > >>> > >>> > >>> > >>> An outage is what it is. I am not worried about outages. We have > multiple transits to deal with that. > >>> > >>> > >>> > >>> It is the keep announcing prefixes after withdrawal from peers and > customers that is the huge problem here. That is killing all the effort and > money I put into having redundancy. It is sabotage of my network after I > cut the ties. I do not want to be a customer at an outlet who has a system > that will do that. Luckily we do not currently have a contract and now they > will have to convince me it is safe for me to make a contract with them. If > that is impossible I guess I won't be getting a contract with them. > >>> > >>> > >>> > >>> But I disagree in that it would be impossible. They need to make a > good report telling exactly what went wrong and how they changed the > design, so something like this can not happen again. The basic design of > BGP is such that this should not happen easily if at all. They did > something unwise. Did they make a route reflector based on a database or > something? > >>> > >>> > >>> > >>> Regards, > >>> > >>> > >>> > >>> Baldur > >>> > >>> > >>> > >>> On Sun, Aug 30, 2020 at 5:13 PM Mike Bolitho <mikeboli...@gmail.com> > wrote: > >>> > >>> Exactly. And asking that they somehow prove this won't happen again is > impossible. > >>> > >>> - Mike Bolitho > >>> > >>> > >>> > >>> On Sun, Aug 30, 2020, 8:10 AM Drew Weaver <drew.wea...@thenap.com> > wrote: > >>> > >>> I’m not defending them but I am sure it isn’t intentional. > >>> > >>> > >>> > >>> From: NANOG <nanog-bounces+drew.weaver=thenap....@nanog.org> On > Behalf Of Baldur Norddahl > >>> Sent: Sunday, August 30, 2020 9:28 AM > >>> To: nanog@nanog.org > >>> Subject: Re: Centurylink having a bad morning? > >>> > >>> > >>> > >>> How is that acceptable behaviour? I shall remember never to make a > contract with these guys until they can prove that they won't advertise my > prefixes after I pull them. Under any circumstances. > >>> > >>> > >>> > >>> søn. 30. aug. 2020 15.14 skrev Joseph Jenkins < > j...@breathe-underwater.com>: > >>> > >>> Finally got through on their support line and spoke to level1. The > only thing the tech could say was it was an issue with BGP route reflectors > and it started about 3am(pacific). They were still trying to isolate the > issue. I've tried failing over my circuits and no go, the traffic just dies > as L3 won't stop advertising my routes. > >>> > >>> > >>> > >>> On Sun, Aug 30, 2020 at 5:21 AM Drew Weaver via NANOG <nanog@nanog.org> > wrote: > >>> > >>> Hello, > >>> > >>> > >>> > >>> Woke up this morning to a bunch of reports of issues with connectivity > had to shut down some Level3/CTL connections to get it to return to normal. > >>> > >>> > >>> > >>> As of right now their support portal won’t load: > https://www.centurylink.com/business/login/ > >>> > >>> > >>> > >>> Just wondering what others are seeing. > >>> > >>> > >>> > >>> > > > > -- > I don't think the execution is relevant when it was obviously a bad > idea in the first place. > This is like putting rabid weasels in your pants, and later expressing > regret at having chosen those particular rabid weasels and that pair > of pants. > ---maf >