On Mon, Jun 24, 2019 at 08:03:26PM -0400, Tom Beecher wrote:
> 
> You are 100% right that 701 should have had some sort of protection
> mechanism in place to prevent this. But do we know they didn???t? Do we know
> it was there and just setup wrong? Did another change at another time break
> what was there? I used 701 many  jobs ago and they absolutely had filtering
> in place; it saved my bacon when I screwed up once and started
> readvertising a full table from a 2nd provider. They smacked my session
> down an I got a nice call about it.

In my past (and current) dealings with AS701, I do agree that they have 
generally
been good about filtering customer sessions and running a tight ship.  But, 
manual
config changes being what they are, I suppose an honest mistake or oversight 
issue
had occurred at 701 today that made them contribute significantly to today's 
outage.


> 
> It also would have been nice, in my opinion, to take a harder stance on the
> BGP optimizer that generated he bogus routes, and the steel company that
> failed BGP 101 and just gladly reannounced one upstream to another. 701 is
> culpable for their mistakes, but there doesn???t seem like there is much
> appetite to shame the other contributors.

I think the biggest question to be asked here -- why the hell is a BGP optimizer
(Noction in this case) injecting fake more specifics to steer traffic?  And why 
did a
regional provider providing IP transit (DQE), use such a dangerous 
accident-waiting-to-
happen tool in their network, especially when they have other ASNs taking 
transit
feeds from them, with all these fake man-in-the-middle routes being injected?

I get that BGP optimizers can have some use cases, but IMO, in most of the 
situations,
(especially if you are a network provider selling transit and taking peering 
yourself)
a well crafted routing policy and interconnection strategy eliminates the need 
for 
implementing flawed route selection optimizers in your network.

The notion of BGP Optimizer generating fake more specifics is absurd, and is 
definitely
not a tool that is designed to "fail -> safe".  Instead of failing safe, it has 
failed
epically and catastrophically today.  I remember long time ago, when Internap 
used
to sell their FCP product, Internap SE were advising the customer to make 
appropriate
adjustments to local-preference to prefer the FCP generated routes to ensure 
optimal
selection.  That is a much more sane design choice, than injecting 
man-in-the-middle
attacks and relying on customers to prevent a disaster.

Any time I have a sit down with any engineer who "outsources" responsibility of 
maintaining robustness principle onto their customer, it makes me want to puke.

James

Reply via email to