People from Big telcom should never reply to mailing lists from work addresses unless specifically allowed, which I suspect TATA doesn't either, based on some direct, buy old knowledge :)
Filtering has been a community issue since my days @ MCI being AS3561, often discussed not often enough acted one, I suspect the topic has come up at every "large" NSP I've worked at. Frequently someone complains its "hard" to fix, or router X makes it hard to fix, or customer Y won;t agree, and not enough people stand up to force fix the issues. I've did a preso on it ( while working at TATA) with some other "smart folks" but for all the usual reasons it died on the vine. I don't blame (3) for this but our community as a whole. Many "people/networks" have to not do the "right thing(tm)" for a failure like this to happen. -jim On Fri, Jun 12, 2015 at 12:43 PM, Utkarsh Gosain < utkarsh.gos...@tatacommunications.com> wrote: > Hi Martin > I am not a spokesperson on behalf of L3 but I have worked for big telcos > my whole career and my recommendation is to raise a trouble ticket if any > one on the forum is their customer and is affected. > I don’t think Engineers at NOC are authorized to reply to forums at any of > the major telcos especially regarding outages unless someone raise a > trouble ticket and seeks an RCA of the issue one on one with them. > > > Utkarsh Gosain > Global Acc Director > Tata Communications > > > -----Original Message----- > From: NANOG [mailto:nanog-boun...@nanog.org] On Behalf Of Martin Millnert > Sent: Friday, June 12, 2015 11:33 AM > To: NANOG > Subject: Open letter to Level3 concerning the global routing issues on > June 12th > > Dear Level3, > > The Internet is a cooperative effort, and it works well only when its > participants take constructive actions to address errors and remedy > problems. > Your position as a major Internet Carrier bestows upon you a certain > degree of responsibility for the correct operation of the Internet all > across (and beyond) the planet. You have many customers. Customers will > always occasionally make mistakes. You as a major Internet Carrier have a > responsibility to limit, not amplify, your customers' mistakes. > Other major carriers implement technical measures that severely limits the > damages from customer mistakes from having global impact. > Other major carriers also implement operational procedures in addition to > technical measures. > In combination, these measures drastically reduce the outage-hours as a > result of customer configuration errors. > > At 08:44 UTC on Friday 12th of June, one of your transit customers, > Telekom Malaysia (AS4788) began announcing the full Internet table back to > you, which you accepted and propagated to your peers and customers, causing > global outages for close to 3 hours. > [ https://twitter.com/DynResearch/status/609340592036970496 ] During this > 3 hour window, it appears (from your own service outage > reports) that you did nothing to stop the global Internet outage, but that > Telekom Malaysia themselves eventually resolved it. This lack of action on > your end, and your disregard for the correct operation of the global > Internet is astonishing. These mistakes do not need to happen. > AS4788 under normal circumstances announces ~1900 IPv4 prefixes to the > Internet. You accepted multiple hundred thousand prefixes from them - a max > prefix setting would have severely limited the damage. We expect that these > are your practices as well, but they failed. When they do, it should not > take ~3 hours to shut down the session(s). > > Many operators, in despair, turned down their peering sessions with you > once it was clear you were causing the outages and no immediate fix was in > sight. This improved the situation for some - but not all did. Had you > deployed proper IRR-filtering to filter the bad announcements the impact > would've been far less critical. > > As a direct consequence of your ~3 hours of inaction, as a local example, > Swedish payment terminals were experiencing problems all over the country. > The Swedish economy was directly affected by your inaction. > There were queues when I was buying lunch! Imagine the food rage. The > situation was probably similar at other places around the globe where > people were awake. > > Operators around the planet are curious: > - Did Level3 not detect or understand that it was causing global > Internet outages for ~3 hours? > - If Level3 did in fact detect or understand it was causing global > Internet outages, why did it not properly and immediately remedy the > situation? > - What is Level3 going to do to address these questions and begin work > on restoring its credibility as a carrier? > > We all understand that mistakes do happen (in applying customer interface > templates, etc.). However the Internet is all too pervasive in everyday > life today for anything but swift action by carriers to remedy breakage > after the fact. It is absolutely not sufficient to let a customer spend 3 > hours to detect and fix a situation like this one. It is unacceptable that > no swift action was taken on your end to limit the global routing issues > you caused. > > Sincerely, > Martin Millnert > Member of Internet Community - no carrier / ISP affiliation. >