Re: [Pacemaker] Fixed! - Re: Problem with dual-PDU fencing node with redundant PSUs

Digimer Thu, 04 Jul 2013 08:16:53 -0700

On 04/07/13 10:06, Dejan Muhamedagic wrote:

On Tue, Jul 02, 2013 at 10:53:50AM -0400, Digimer wrote:

On 07/02/2013 04:02 AM, Dejan Muhamedagic wrote:

On Mon, Jul 01, 2013 at 11:53:29AM -0400, Digimer wrote:

On 07/01/2013 04:52 AM, Dejan Muhamedagic wrote:

Right. It is often missed that actually more than one failure is
required for that setup to fail. In case of dual PDU/PSU/UPS an
IPMI based fencing is sufficient.


You are right, of course. Imagine though that the IPMI BMC's network
port or cable could have silently failed some time before the node
failed. Yes, this is two simultaneous failues so not an overall SPoF,
but likely enough that it should be addressed.

If you've already setup redundant power, then it strikes me as fairly
easy to use your PDUs as a backup fence method.

Now all this said, you'll note in the mailing lists and IRC that I don't
tell people they should have two methods. If people setup just IPMI
fencing, I am happy. It's a question of how careful do you want/need to
be, after that. For me, one fence method is not enough.


I suppose that you're supporting a few clusters. How often does
it happen that nodes get fenced? And why? And did you in those
cases needed to use the backup fence device?

Thanks,

Dejan


They occasionally get fenced, but it's very rare. Most were from an
earlier configuration I no longer offer that were based on one switch
(with redundant NICs in bond mode=1). The switch would hiccup and that
would trigger fencing. Since I switched to dual switches, I've not had a
network-triggered failure.

The most common problem I see, that my cluster saved people from, is
power problems. These have never required fencing, but rather simply
having two monitored UPSes has allowed us to detecting pending
catastrophic power failures (a transformer blew up three days after we
started seeing alerts, a faulty regulator in a customer's neighborhood,
etc).


Right, I'd also guess that power failures are the most common in
the hardware category.

We've also saved a customer's entire (small) DC when they lost AC and
their own alerts failed (we saw a sudden rise in inlet temp and alerted
the client.). One node at the top of the rack (out of four dual-node
clusters) went into thermal shutdown and got fenced before we could shed
enough load. They didn't lose any of their non-clustered servers though.

So to your question; have we ever needed the backup fencing in
production? Nope, but I see it as just a matter of time. One user error,
one bad UPS/battery pack, one tripped breaker and it will save us. When
we demo our clusters to perspective customers, the most dramatic test we
do is shut down the primary UPS. This takes out one of the switches, one
of the dashboard appliances and forces the nodes to run on half their
power. If this happened in production, then dual-PDUs would certainly
save us.

Not my personal experience, but a sysadmin friend of mine had a case
where a server's 12vDC wire was rubbing against a sharp piece of the
chassis. Eventually it cut through the insulation and shorted out,
taking the node's power off despite having redundant PSUs. Had this
happened to our cluster, we'd have been saved by the backup fence device
because the IPMI would have been lost.


There are also some light-out devices with battery backup
providing power for enough time for fencing to succeed.

I've got ten or so customers around north america and I've only been
doing this for four years or so. That I have not *yet* been saved by
backup fencing in no way means it is not needed. :)


I'd really be interested in numbers which we don't have, that is
how much extra availability in a fully redundant power supply
setup a backup fencing device provides.

Of course, taking every possible precaution is commendable, but
in this case it seems like it introduces a level of complexity
which is hard to grasp for most of people (even those running
clusters).

Thanks,

Dejan

Much like security, performance and other concerns; It's up to each userto find their balance point. For me and my customers, redundanteverything is required. For many others, perhaps it isn't.

As for the numbers; I would *love* to have those as well. Shy of someself-reporting system where HA admins fill out forms after incidentsthough, I don't see how we could ever gather that data. Even then, itwill never be mandatory, obviously, so the results would be skewed bythe personality type of people willing and able to take the time tosubmit those anonymous reports.


cheers!

--
Digimer
Papers and Projects: https://alteeve.ca/w/

What if the cure for cancer is trapped in the mind of a person withoutaccess to education?


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Fixed! - Re: Problem with dual-PDU fencing node with redundant PSUs

Reply via email to