BGP failure analysis and recommendations

JRC NOC Wed, 23 Oct 2013 19:41:54 -0700

Hello Nanog -

On Saturday, October 19th at about 13:00 UTC we experienced an IP failureat one of our sites in the New York area.It was apparently a widespread outage on the East coast, but I haven't seenit discussed here.

We are multihomed, using EBGP to three (diverse) upstream providers. Oneprovider experienced a hardware failure in a core component at one POP.Regrettably, during the outage our BGP session remained active and wecontinued receiving full routes from the affected AS. And our prefixescontinued to be advertised at their border. However basically none of thetraffic between those prefixes over that provider was delivered. The bogusroutes stayed up for hours. We shutdown the BGP peering session when thenature of the problem became clear. This was effective. I believe that allcustomer BGP routes were similarly affected, including those belonging tosome large regional networks and corporations. I have raised the questionsbelow with the provider but haven't received any information or advice.

My question is why did our BGP configuration fail? I'm guessing the basicanswer is that the IGP and route reflectors within that provider were stillconnected, but the forwarding paths were unavailable. My BGP sessionbasically acted like a bunch of static routes, with no awareness of thefailure(s) and no dynamic reconfiguration of the RIB.


Is this just an unavoidable issue with scaling large networks?
Is it perhaps a known side effect of MPLS?

Have we/they lost something important in the changeover to convergedmutiprotocol networks?Is there a better way for us edge networks to achieve IP resiliency in thecurrent environment?

This is an operational issue. Thanks in advance for any hints about whathappened or better practices to reduce the impact of a routine hardwarefault in an upstream network.


- Eric Jensen

Date: Wed, 23 Oct 2013 21:26:43 -0400
To: c...@chrisjensen.org
From: JRC NetOps <n...@jensenresearch.com>
Subject: Fwd: BGP failure analysis and recommendations
Date: Mon, 21 Oct 2013 23:19:28 -0400
To: christopher.sm...@level3.com
From: Eric Jensen <ejen...@jensenresearch.com>
Subject: BGP failure analysis and recommendations
Cc: "Joe Budelis Fast-E.com" <j...@fast-e.com>
Bcc: n...@jensenresearch.com

Hello Christopher Smith -
I left you a voicemail message today. The Customer Service folks alsogave me your email address.
We have a small, but high-value multi-homed corporate network.
We operate using our AS number 17103.
We have BGP transit circuits with Level 3, Lightpath, and at our colocenter (AS8001)
The Level 3 circuit ID is BBPM9946
On Saturday, October 19 2013 we had a large IP outage. I tracked it backto our Level 3 circuit and opened a ticket (7126634).I have copied (below) an email I sent our channel salesman with moredetails about our BGP problems during your outage.Briefly, I am very concerned that Level 3 presented routes to us thatwere not actually reachable through your network, and even worse Level 3kept advertising our prefixes even though your network couldn't deliverthe traffic to us for those prefixes.
I believe that the BGP NLRI data should follow the same IP path as theforwarded data itself. Apparently this isn't the case at Level 3.I also believe that your MPLS backbone should have recoveredautomatically from the forwarding failure, but this didn't happen either.
My only fix was to manually shutdown the BGP peering session with Level 3.

Can you explain to me how Level 3 black-holed my routes?
Can you suggest some change to our or your BGP configuration to eliminatethis BGP failure mode?
Just to be clear, I don't expect our circuit, or your network, to be upall the time. But I do expect that the routes you advertise to us and toyour BGP peers actually be reachable through your network. On Saturdaythis didn't happen. The routes stayed up while the data transport was down.
Our IPv4 BGP peering session with Level 3 remains down in the interim.
Please get back to me as soon as possible.

- Eric Jensen
AS17103
201-741-9509
Date: Mon, 21 Oct 2013 22:55:35 -0400
To: "Joe Budelis Fast-E.com" <j...@fast-e.com>
From: Eric Jensen <ejen...@jensenresearch.com>
Subject: Re:  Fwd: Level3 Interim Response
Bcc: n...@jensenresearch.com

Hi Joe-

Thanks for making the new inquiry.
This was a big outage. Apparently Time Warner Cable and Cablevisionwere affected greatly. Plus many large corporate networks. And of courseall the single-homed Level 3 customers worldwide. My little network wasjust one more casualty.
See:

http://www.dslreports.com/forum/r28749556-Internet-Level3-Outage-

http://online.wsj.com/news/articles/SB10001424052702304864504579145813698584246
For our site, the massive outage started at about 9:00 am Saturday andlasted until after 2:00PM. I opened a ticket about 9:30 am but onlyrealized the routing issues and took down our BGP session about 12:00 totry to minimize the problems for our traffic caused by theirmisconfigured BGP.
There can always be equipment failures and fiber cuts. That's not theproblem.From my point of view the problem was/is that Level 3 kept"advertising" our prefix but couldn't deliver the packets to us. Theydid this for all their customer's prefixes, thereby sucking in abouthalf the NYC area internet traffic and dumping into the Hudson River,for a huge period of time.They also kept advertising all their BGP routes to me, thereby foolingmy routers into sending outbound traffic to Level 3 where they againdumped my traffic into the Hudson.
I called Level 3 customer service today and have the name of a networkengineer to discuss options for fixing the BGP failure.
If you get any response with an engineering contact please let me know.
I shouldn't have to manually intervene to route around problems. Evensadder is the response from Level 3 explaining that they spent hourstrying to find the problem and had to manually reconfigure theirnetwork, leading to saturated links and more problems. Their networkonly healed when the faulty line card was replaced.
I had reactivated the BGP session later that night, but after reviewingthe actual damage that we incurred, and the widespread nature of thefailure, I have decided to leave our Level 3 BGP session down, at leastuntil the engineering situation improves.There may not be any good way to use a Level 3 BGP session withoutrisking the same "black hole" problem going forward. It's that type offailure that BGP is specifically designed to deal with, but it wasdeveloped in the days of point-to-point circuits carrying IP traffic.
Nowadays some networks have a new layer between the wires and IP, namelyMPLS, and this allowed BGP to stay up but deprived the routers offunctioning IP next-hops, which they (both the Level 3 IP routers andthe Level 3 personnel) were unaware of. Apparently the Level 3 IP-basedBGP routers all believed they had working circuits edge-to-edge, but infact their network was partitioned.
MPLS must have some redundancy features, but they obviously weren'tworking on Saturday. This is a huge engineering failure. No large ISPcould function this way for long.
I can wait the 72 hours for their response. I expect it will be full ofmealy-mouth platitudes about how no system is foolproof and it will allbe fine now.
It would be more interesting to me to be in the meeting room where someengineer has to explain how they could lose so much traffic and not beable to operate a functioning, if degraded, network after a single linecard failure. It wouldn't be the head of network design, because thatperson would already have been fired.
Let me know if your hear anything. I will do the same.

- Eric Jensen
AS17103
201-741-9509

BGP failure analysis and recommendations

Reply via email to