On Mon, Jul 2, 2012 at 12:43 PM, Greg D. Moore <moor...@greenms.com> wrote: > At 03:08 PM 7/2/2012, George Herbert wrote: > > If folks have not read it, I would suggest reading Normal Accidents by > Charles Perrow. > > The "it can't happen" is almost guaranteed to happen. ;-) And when it does, > it'll often interact in ways we can't predict or sometimes even understand.
Seconded. There are also aerospace and nuclear and failure analysis books which are good, but I often encourage people to start with that one. > As for pulling the plug to test stuff. I recall a demo at Netapps in the > early 00's. They were talking about their fault tolerance and how great it > was. So I walked up to their demo array and said, "So, it shouldn't be a > problem if I pulled this drive right here?" Before I could the salesperson > or tech guy, can't remember, told me to stop. He didn't want to risk it. > > That right there said loads about their confidence in their own system. I worked for a Sun clone vendor (Axil) for a while and took some of our systems and storage to Comdex one year in the 90s. We had a RAID unit (Mylex controller) we had just introduced. Beforehand, I made REALLY REALLY SURE that the pull-the-disk and pull-the-redundant-power tricks worked. And showed them to people with the "Please keep in mind that this voids the warranty, but here we *rip* go...". All of the other server vendors were giving me dirty looks for that one. Apparently I sold a few systems that way. You have to watch for connector wear-out and things like that, but ... All the clusters I've built, I've insisted on a burn-in time plug pull test on all the major components. We caught things with those from time to time. Especially with N+1, if it is really N+0 due to a bug or flaw you need to know that... -- -george william herbert george.herb...@gmail.com