On Tue, Jul 02, 2013 at 10:53:50AM -0400, Digimer wrote: > On 07/02/2013 04:02 AM, Dejan Muhamedagic wrote: > > On Mon, Jul 01, 2013 at 11:53:29AM -0400, Digimer wrote: > >> On 07/01/2013 04:52 AM, Dejan Muhamedagic wrote: > >>> Right. It is often missed that actually more than one failure is > >>> required for that setup to fail. In case of dual PDU/PSU/UPS an > >>> IPMI based fencing is sufficient. > >> > >> You are right, of course. Imagine though that the IPMI BMC's network > >> port or cable could have silently failed some time before the node > >> failed. Yes, this is two simultaneous failues so not an overall SPoF, > >> but likely enough that it should be addressed. > >> > >> If you've already setup redundant power, then it strikes me as fairly > >> easy to use your PDUs as a backup fence method. > >> > >> Now all this said, you'll note in the mailing lists and IRC that I don't > >> tell people they should have two methods. If people setup just IPMI > >> fencing, I am happy. It's a question of how careful do you want/need to > >> be, after that. For me, one fence method is not enough. > > > > I suppose that you're supporting a few clusters. How often does > > it happen that nodes get fenced? And why? And did you in those > > cases needed to use the backup fence device? > > > > Thanks, > > > > Dejan > > They occasionally get fenced, but it's very rare. Most were from an > earlier configuration I no longer offer that were based on one switch > (with redundant NICs in bond mode=1). The switch would hiccup and that > would trigger fencing. Since I switched to dual switches, I've not had a > network-triggered failure. > > The most common problem I see, that my cluster saved people from, is > power problems. These have never required fencing, but rather simply > having two monitored UPSes has allowed us to detecting pending > catastrophic power failures (a transformer blew up three days after we > started seeing alerts, a faulty regulator in a customer's neighborhood, > etc).
Right, I'd also guess that power failures are the most common in the hardware category. > We've also saved a customer's entire (small) DC when they lost AC and > their own alerts failed (we saw a sudden rise in inlet temp and alerted > the client.). One node at the top of the rack (out of four dual-node > clusters) went into thermal shutdown and got fenced before we could shed > enough load. They didn't lose any of their non-clustered servers though. > > So to your question; have we ever needed the backup fencing in > production? Nope, but I see it as just a matter of time. One user error, > one bad UPS/battery pack, one tripped breaker and it will save us. When > we demo our clusters to perspective customers, the most dramatic test we > do is shut down the primary UPS. This takes out one of the switches, one > of the dashboard appliances and forces the nodes to run on half their > power. If this happened in production, then dual-PDUs would certainly > save us. > > Not my personal experience, but a sysadmin friend of mine had a case > where a server's 12vDC wire was rubbing against a sharp piece of the > chassis. Eventually it cut through the insulation and shorted out, > taking the node's power off despite having redundant PSUs. Had this > happened to our cluster, we'd have been saved by the backup fence device > because the IPMI would have been lost. There are also some light-out devices with battery backup providing power for enough time for fencing to succeed. > I've got ten or so customers around north america and I've only been > doing this for four years or so. That I have not *yet* been saved by > backup fencing in no way means it is not needed. :) I'd really be interested in numbers which we don't have, that is how much extra availability in a fully redundant power supply setup a backup fencing device provides. Of course, taking every possible precaution is commendable, but in this case it seems like it introduces a level of complexity which is hard to grasp for most of people (even those running clusters). Thanks, Dejan > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person without > access to education? > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org