Not buying this explanation for a number of reasons :

1.  Are you telling me that several line cards failed in multiple cities in the 
same way at the same time?  Don't think so unless the same software fault was 
propagated to all of them.  If the problem was that they needed to be reset, 
couldn't that be accomplished by simply reseating them?

2.  Do we believe that an OOB management card was able to generate so much 
traffic as to bring down the optical switching?  Very doubtful which means that 
the systems were actually broken due to trying to PROCESS the "invalid frames". 
 Seems like very poor control plane management if the system is attempting to 
process invalid data and bringing down the forwarding plane.

3.  In the cited document it was stated that the offending packet did not have 
source or destination information.  If so, how did it get propagated throughout 
the network?

My guess at the time and my current opinion (which has no real factual basis, 
just years of experience) is that a bad software package was propagated through 
their network.

Steven Naslund
Chicago IL

>
> One thing that is troubling when reading that URL is that it appears several 
> steps of restoration required teams to go onsite for local login, etc.,. 
> Granted, to troubleshoot hardware you need to be physically present to pop a 
> line card in and out, but CTL/LVL3 should have full out-of-band console and 
> power control to all core devices, we shouldn't be waiting for someone to 
> drive to a location to get console or do power cycling. And I would imagine 
> the first step to alot of the troubleshooting was power cycling and local 
> console logs.
>
>
> -John
>
>
>
> On 12/30/18 10:42 AM, Mike Hammett wrote:
>
> It's technical enough so that laypeople immediately lose interest, yet 
> completely useless to anyone that works with this stuff.
>
>
>
> -----
> Mike Hammett
> Intelligent Computing Solutions
> http://www.ics-il.com
>
> Midwest-IX
> http://www.midwest-ix.com
>
> ________________________________
> From: "Saku Ytti" <s...@ytti.fi>
> To: "nanog list" <nanog@nanog.org>
> Sent: Sunday, December 30, 2018 7:42:49 AM
> Subject: CenturyLink RCA?
>
> Apologies for the URL, I do not know official source and I do not 
> share the URLs sentiment.
> https://fuckingcenturylink.com/
>
> Can someone translate this to IP engineer? What did actually happen?
> From my own history, I rarely recognise the problem I fixed from 
> reading the public RCA. I hope CenturyLink will do better.
>
> Best guess so far that I've heard is
>
> a) CenturyLink runs global L2 DCN/OOB
> b) there was HW fault which caused L2 loop (perhaps HW dropped BPDU, 
> I've had this failure mode)
> c) DCN had direct access to control-plane, and L2 congested 
> control-plane resources causing it to deprovision waves
>
> Now of course this is entirely speculation, but intended to show what 
> type of explanation is acceptable and can be used to fix things.
> Hopefully CenturyLink does come out with IP-engineering readable 
> explanation, so that we may use it as leverage to support work in our 
> own domains to remove such risks.
>
> a) do not run L2 DCN/OOB
> b) do not connect MGMT ETH (it is unprotected access to control-plane, 
> it  cannot be protected by CoPP/lo0 filter/LPTS ec)
> c) do add in your RFP scoring item for proper OOB port (Like Cisco 
> CMP)
> d) do fail optical network up
>
> --
>   ++ytti
>


--
  ++ytti

Reply via email to