Re: FYI Netflix is down

AP NANOG Mon, 02 Jul 2012 12:33:33 -0700

I believe in my dictionary Chaos Gorilla translates into "Time To GoHome", with a rough definition of "Everything just crapped out - Theworld is ending"; but then again I may have hat incorrect :-)

--


Thank you,

Robert Miller
http://www.armoredpackets.com

Twitter: @arch3angel

On 7/2/12 2:59 PM, Paul Graydon wrote:

On 07/02/2012 08:53 AM, Tony McCrory wrote:
On 2 July 2012 19:20, Cameron Byrne <cb.li...@gmail.com> wrote:
Make your chaos animal go after sites and regions instead of individual
VMs.

CB
 From a previous post mortem
http://techblog.netflix.com/2011_04_01_archive.html

"
Create More Failures
Currently, Netflix uses a service called "Chaos
Monkey<http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html>"
to simulate service failure. Basically, Chaos Monkey is a service that
kills other services. We run this service because we want engineeringteams
to be used to a constant level of failure in the cloud. Services should
automatically recover without any manual intervention. We don't however,
simulate what happens when an entire AZ goes down and therefore wehaven'tengineered our systems to automatically deal with those sorts offailures.
Internally we are having discussions about doing that and people are
already starting to call this service "Chaos Gorilla".
*"*

It would seem the Gorilla hasn't quite matured.

Tony
From conversations with Adrian Cockcroft this weekend it wasn't theresult of Chaos Gorilla or Chaos Monkey failing to prepare themadequately. All their automated stuff worked perfectly, theinfrastructure tried to self heal. The problem was that yet againAmazon's back-plane / control-plane was unable to cope with therequests. Netflix uses Amazon's ELB to balance the traffic and noback-plane meant they were unable to reconfigure it to route aroundthe problem.
Paul

Re: FYI Netflix is down

Reply via email to