Re: CloudFlare issues?

Jared Mauch Mon, 24 Jun 2019 17:58:43 -0700

> On Jun 24, 2019, at 8:03 PM, Tom Beecher <beec...@beecher.cc> wrote:
> 
> Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do not work 
> on 701.  My comments are my own opinions only. 
> 
> Respectfully, I believe Cloudflare’s public comments today have been a real 
> disservice. This blog post, and your CEO on Twitter today, took every 
> opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not. 

I presume that seeing a CF blog post isn’t regular for you. :-). — please read 
on

> You are 100% right that 701 should have had some sort of protection mechanism 
> in place to prevent this. But do we know they didn’t? Do we know it was there 
> and just setup wrong? Did another change at another time break what was 
> there? I used 701 many  jobs ago and they absolutely had filtering in place; 
> it saved my bacon when I screwed up once and started readvertising a full 
> table from a 2nd provider. They smacked my session down an I got a nice call 
> about it. 
> 
> You guys have repeatedly accused them of being dumb without even speaking to 
> anyone yet from the sounds of it. Shouldn’t we be working on facts? 
> 
> Should they have been easier to reach once an issue was detected? Probably. 
> They’re certainly not the first vendor to have a slow response time though. 
> Seems like when an APAC carrier takes 18 hours to get back to us, we write it 
> off as the cost of doing business. 
> 
> It also would have been nice, in my opinion, to take a harder stance on the 
> BGP optimizer that generated he bogus routes, and the steel company that 
> failed BGP 101 and just gladly reannounced one upstream to another. 701 is 
> culpable for their mistakes, but there doesn’t seem like there is much 
> appetite to shame the other contributors. 
> 
> You’re right to use this as a lever to push for proper filtering , RPKI, best 
> practices. I’m 100% behind that. We can all be a hell of a lot better at what 
> we do. This stuff happens more than it should, but less than it could. 
> 
> But this industry is one big ass glass house. What’s that thing about stones 
> again? 

I’m careful to not talk about the people impacted.  There were a lot of people 
impacted, roughly 3-4% of the IP space was impacted today and I personally 
heard from more providers than can be counted on a single hand about their 
impact.

Not everyone is going to write about their business impact in public.  I’m not 
authorized to speak for my employer about any impacts that we may have had (for 
example) but if there was impact to 3-4% of IP space, statistically speaking 
there’s always a chance someone was impacted.

I do agree about the glass house thing.  There’s a lot of blame to go around, 
and today I’ve been quoting “go read _normal accidents_” to people.  It’s 
because sufficiently complex systems tend to have complex failures where 
numerous safety systems or controls were bypassed.  Those of us with more than 
a few days of experience likely know what some of them are, we also don’t know 
if those safety systems were disabled as part of debugging by one or more 
parties.  Who hasn’t dropped an ACL to debug why it isn’t working, or if that 
fixed the problem?

I don’t know what happened, but I sure know the symptoms and sets of fixes that 
the industry should apply and enforce.  I have been communicating some of them 
in public and many of them in private today, including offering help to other 
operators with how to implement some of the fixes.

It’s a bad day when someone changes your /16 to two /17’s and sends them out 
regardless of if the packets flow through or not.  These things aren’t new, nor 
do I expect things to be significantly better tomorrow either.  I know people 
at VZ and suspect once they woke up they did something about it.  I also know 
how hard it is to contact someone you don’t have a business relationship with.  
A number of the larger providers have no way for a non-customer to phone, 
message or open a ticket online about problems they may have.  Who knows, their 
ticket system may be in the cloud and was also impacted.

What I do know is that if 3-4% of the home/structures were flooded or 
temporarily unusable because of some form of disaster or evacuation, people 
would be proposing better engineering methods or inspection techniques for 
these structures.

If you are a small network and just point default, there is nothing for you to 
see here and nothing that you can do.  If you speak BGP with your upstream, you 
can filter out some of the bad routes.  You perhaps know that 1239, 3356 and 
others should only be seen directly from a network like 701 and can apply 
filters of this sort to prevent from accepting those more specifics.  I don’t 
believe it’s just 174 that the routes went to, but they were one of the 
networks aside from 701 where I saw paths from today.

(Now the part where you as a 3rd party to this event can help!)

If you peer, build some pre-flight and post-flight scripts to check how many 
routes you are sending.  Most router vendors support either on-box scripting, 
or you can do a show | display xml, JSON or some other structured language you 
can automate with.  AS_PATH filters are simple, low cost and can help mitigate 
problems.  Consider monitoring your routes with a BMP server (pmacct has a 
great one!).  Set max-prefix (and monitor if you near thresholds!).  Configure 
automatic restarts if you won’t be around to fix it.

I hate to say “automate all the things”, but at least start with monitoring so 
you can know when things go bad.  Slack and other things have great APIs and 
you can have alerts sent to your systems telling you of problems.  Try hard to 
automate your debugging.  Monitor for announcements of your space.  The new RIS 
Live API lets you do this and it’s super easy to spin something up.

Hold your suppliers accountable as well.  If you are a customer of a network 
that was impacted or accepted these routes, ask for a formal RFO and what the 
corrective actions are.  Don’t let them off the hook as it will happen again.

If you are using route optimization technology, make double certain it’s not 
possible to leak routes.  Cisco IOS and Noction are two products that I either 
know or have been told don’t have default safe settings enabled.  I learned 
early on in the 90s the perils of having “everything on, unprotected” by 
default.  There were great bugs in software that allowed devices to be 
compromised at scale which made comparable cleanup problems to what we’ve seen 
in recent years with IoT or other technologies.  Tell your vendors you want 
them to be secure by default, and vote with your personal and corporate wallet 
when you can.

It won’t always work, some vendors will not be able or willing to clean up 
their acts, but unless we act together as an industry to clean up the glass 
inside our own homes, expect someone from the outside to come at some point who 
can force it, and it may not even make sense (ask anyone who deals with 
security audit checklists) but you will be required to do it.

Please take action within your power at your company.  Stand up for what is 
right for everyone with this shared risk and threat.  You may not enjoy who the 
messenger is (or the one who is the loudest) but set that aside for the 
industry.

</soapbox>

- Jared

PS. We often call ourselves network engineers or architects.  If we are truly 
that, we are using those industry standards as building blocks to ensure a 
solid foundation.  Make sure your foundation is stable.  Learn from others 
mistakes to design and operate the best network feasible.
Re: CloudFlare issues?

Reply via email to