> On Jun 24, 2019, at 8:03 PM, Tom Beecher <beec...@beecher.cc> wrote:
>
> Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do not work
> on 701. My comments are my own opinions only.
>
> Respectfully, I believe Cloudflare’s public comments today have been a real
> disservice. This blog post, and your CEO on Twitter today, took every
> opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not.
I presume that seeing a CF blog post isn’t regular for you. :-). — please read
on
> You are 100% right that 701 should have had some sort of protection mechanism
> in place to prevent this. But do we know they didn’t? Do we know it was there
> and just setup wrong? Did another change at another time break what was
> there? I used 701 many jobs ago and they absolutely had filtering in place;
> it saved my bacon when I screwed up once and started readvertising a full
> table from a 2nd provider. They smacked my session down an I got a nice call
> about it.
>
> You guys have repeatedly accused them of being dumb without even speaking to
> anyone yet from the sounds of it. Shouldn’t we be working on facts?
>
> Should they have been easier to reach once an issue was detected? Probably.
> They’re certainly not the first vendor to have a slow response time though.
> Seems like when an APAC carrier takes 18 hours to get back to us, we write it
> off as the cost of doing business.
>
> It also would have been nice, in my opinion, to take a harder stance on the
> BGP optimizer that generated he bogus routes, and the steel company that
> failed BGP 101 and just gladly reannounced one upstream to another. 701 is
> culpable for their mistakes, but there doesn’t seem like there is much
> appetite to shame the other contributors.
>
> You’re right to use this as a lever to push for proper filtering , RPKI, best
> practices. I’m 100% behind that. We can all be a hell of a lot better at what
> we do. This stuff happens more than it should, but less than it could.
>
> But this industry is one big ass glass house. What’s that thing about stones
> again?
I’m careful to not talk about the people impacted. There were a lot of people
impacted, roughly 3-4% of the IP space was impacted today and I personally
heard from more providers than can be counted on a single hand about their
impact.
Not everyone is going to write about their business impact in public. I’m not
authorized to speak for my employer about any impacts that we may have had (for
example) but if there was impact to 3-4% of IP space, statistically speaking
there’s always a chance someone was impacted.
I do agree about the glass house thing. There’s a lot of blame to go around,
and today I’ve been quoting “go read _normal accidents_” to people. It’s
because sufficiently complex systems tend to have complex failures where
numerous safety systems or controls were bypassed. Those of us with more than
a few days of experience likely know what some of them are, we also don’t know
if those safety systems were disabled as part of debugging by one or more
parties. Who hasn’t dropped an ACL to debug why it isn’t working, or if that
fixed the problem?
I don’t know what happened, but I sure know the symptoms and sets of fixes that
the industry should apply and enforce. I have been communicating some of them
in public and many of them in private today, including offering help to other
operators with how to implement some of the fixes.
It’s a bad day when someone changes your /16 to two /17’s and sends them out
regardless of if the packets flow through or not. These things aren’t new, nor
do I expect things to be significantly better tomorrow either. I know people
at VZ and suspect once they woke up they did something about it. I also know
how hard it is to contact someone you don’t have a business relationship with.
A number of the larger providers have no way for a non-customer to phone,
message or open a ticket online about problems they may have. Who knows, their
ticket system may be in the cloud and was also impacted.
What I do know is that if 3-4% of the home/structures were flooded or
temporarily unusable because of some form of disaster or evacuation, people
would be proposing better engineering methods or inspection techniques for
these structures.
If you are a small network and just point default, there is nothing for you to
see here and nothing that you can do. If you speak BGP with your upstream, you
can filter out some of the bad routes. You perhaps know that 1239, 3356 and
others should only be seen directly from a network like 701 and can apply
filters of this sort to prevent from accepting those more specifics. I don’t
believe it’s just 174 that the routes went to, but they were one of the
networks aside from 701 where I saw paths from today.
(Now the part where you as a 3rd party to this event can help!)
If you peer, build some pre-flight and post-flight scripts to check how many
routes you are sending. Most router vendors support either on-box scripting,
or you can do a show | display xml, JSON or some other structured language you
can automate with. AS_PATH filters are simple, low cost and can help mitigate
problems. Consider monitoring your routes with a BMP server (pmacct has a
great one!). Set max-prefix (and monitor if you near thresholds!). Configure
automatic restarts if you won’t be around to fix it.
I hate to say “automate all the things”, but at least start with monitoring so
you can know when things go bad. Slack and other things have great APIs and
you can have alerts sent to your systems telling you of problems. Try hard to
automate your debugging. Monitor for announcements of your space. The new RIS
Live API lets you do this and it’s super easy to spin something up.
Hold your suppliers accountable as well. If you are a customer of a network
that was impacted or accepted these routes, ask for a formal RFO and what the
corrective actions are. Don’t let them off the hook as it will happen again.
If you are using route optimization technology, make double certain it’s not
possible to leak routes. Cisco IOS and Noction are two products that I either
know or have been told don’t have default safe settings enabled. I learned
early on in the 90s the perils of having “everything on, unprotected” by
default. There were great bugs in software that allowed devices to be
compromised at scale which made comparable cleanup problems to what we’ve seen
in recent years with IoT or other technologies. Tell your vendors you want
them to be secure by default, and vote with your personal and corporate wallet
when you can.
It won’t always work, some vendors will not be able or willing to clean up
their acts, but unless we act together as an industry to clean up the glass
inside our own homes, expect someone from the outside to come at some point who
can force it, and it may not even make sense (ask anyone who deals with
security audit checklists) but you will be required to do it.
Please take action within your power at your company. Stand up for what is
right for everyone with this shared risk and threat. You may not enjoy who the
messenger is (or the one who is the loudest) but set that aside for the
industry.
</soapbox>
- Jared
PS. We often call ourselves network engineers or architects. If we are truly
that, we are using those industry standards as building blocks to ensure a
solid foundation. Make sure your foundation is stable. Learn from others
mistakes to design and operate the best network feasible.