> Actually, it was a very complex power outage. I'm going to assume that what > happened this weekend was similar to the event that happened at the same > facility approximately two weeks ago (its immaterial - the details are > probably different, but it illustrates the complexity of a data center > failure) > > Utility Power Failed > First Backup Generator Failed (shut down due to a faulty fan) > Second Backup Generator Failed (breaker coordination problem resulting in > faulty trip of a breaker) > > In this case, it was clearly a cascading failure, although only limited in > scope. The failure in this case, also clearly involved people. There was one > material failure (the fan), but the system should have been resilient enough > to deal with it. The system should also have been resilient enough to deal > with the breaker coordination issue (which should not have occurred), but was > not. Data centers are not commodities. There is a way to engineer these > facilities to be much more resilient. Not everyone's business model supports > it.
ok, i give in. as some level of granularity everything is a cascading failure (since molecules colide and the world is an infinite chain of causation in which human free will is merely a myth </Spinoza>) of course, this use of 'cascading' is vacuous and not useful anymore since it applies to nearly every failure, but i'll go along with it. from the perspective of a datacenter power engineer, this was a cascading failure of a few small number of components. from the perspective of every datacenter customer: this was a power failure. from the perspective of people watching B-rate movies: this was a failure to implement and test a reliable system for streaming those movies in the face of a power outage at one facility. from the perspective of nanog mailing list readers: this was an interesting opportunity to speculate about failures about which we have no data (as usual!). can we all agree on those facts? :-) t