The problem is large scale tests take a lot of time and planning. For it to be done right, you really need a dedicated DR team.
-Grant On Mon, Jul 2, 2012 at 11:31 AM, AP NANOG <na...@armoredpackets.com> wrote: > This is an excellent example of how tests "should" be ran, unfortunately > far too many places don't do this... > > > -- > > Thank you, > > Robert Miller > http://www.armoredpackets.com > > Twitter: @arch3angel > > On 7/2/12 12:09 PM, Leo Bicknell wrote: > >> In a message written on Mon, Jul 02, 2012 at 11:30:06AM -0400, Todd >> Underwood wrote: >> >>> from the perspective of people watching B-rate movies: this was a >>> failure to implement and test a reliable system for streaming those >>> movies in the face of a power outage at one facility. >>> >> I want to emphasize _and test_. >> >> Work on an infrastructure which is redundant and designed to provide >> "100% uptime" (which is impossible, but that's another story) means >> that there should be confidence in a failure being automatically >> worked around, detected, and reported. >> >> I used to work with a guy who had a simple test for these things, >> and if I was a VP at Amazon, Netflix, or any other large company I >> would do the same. About once a month he would walk out on the >> floor of the data center and break something. Pull out an ethernet. >> Unplug a server. Flip a breaker. >> >> Then he would wait, to see how long before a technician came to fix >> it. >> >> If these activities were service impacting to customers the engineering >> or implementation was faulty, and remediation was performed. Assuming >> they acted as designed and the customers saw no faults the team was >> graded on how quickly the detected and corrected the outage. >> >> I've seen too many companies who's "test" is planned months in advance, >> and who exclude the parts they think aren't up to scratch from the test. >> Then an event occurs, and they fail, and take down customers. >> >> TL;DR If you're not confident your operation could withstand someone >> walking into your data center and randomly doing something, you are >> NOT redundant. >> >> >