On 12/21/2012 07:47 PM, Andrew Martin wrote: > Andreas, > > Thanks for the help. Please see my replies inline below. > > ----- Original Message ----- >> From: "Andreas Kurz" <andr...@hastexo.com> >> To: pacemaker@oss.clusterlabs.org >> Sent: Friday, December 21, 2012 10:11:08 AM >> Subject: Re: [Pacemaker] Best way to recover from failed STONITH? >> >> On 12/21/2012 04:18 PM, Andrew Martin wrote: >>> Hello, >>> >>> Yesterday a power failure took out one of the nodes and its STONITH >>> device (they share an upstream power source) in a 3-node >>> active/passive cluster (Corosync 2.1.0, Pacemaker 1.1.8). After >>> logging into the cluster, I saw that the STONITH operation had >>> given up in failure and that none of the resources were running on >>> the other nodes: >>> Dec 20 17:59:14 [18909] quorumnode crmd: notice: >>> too_many_st_failures: Too many failures to fence node0 (11), >>> giving up >>> >>> I brought the failed node back online and it rejoined the cluster, >>> but no more STONITH attempts were made and the resources remained >>> stopped. Eventually I set stonith-enabled="false" ran killall on >>> all pacemaker-related processes on the other (remaining) nodes, >>> then restarted pacemaker, and the resources successfully migrated >>> to one of the other nodes. This seems like a rather invasive >>> technique. My questions about this type of situation are: >>> - is there a better way to tell the cluster "I have manually >>> confirmed this node is dead/safe"? I see there is the meatclient >>> command, but can that only be used with the meatware STONITH >>> plugin? >> >> crm node cleanup quorumnode > > I'm using the latest version of crmsh (1.2.1) but it doesn't seem to support > this command:
ah ... sorry, true ... its the "clearstate" command ... but it does a "cleanup" ;-) > root@node0:~# crm --version > 1.2.1 (Build unknown) > root@node0:~# crm node > crm(live)node# help > > Node management and status commands. > > Available commands: > > status show nodes' status as XML > show show node > standby put node into standby > online set node online > fence fence node > clearstate Clear node state > delete delete node > attribute manage attributes > utilization manage utilization attributes > status-attr manage status attributes > help show help (help topics for list of topics) > end go back one level > quit exit the program > Also, do I run cleanup on just the node that failed, or all of them? You need to specify a node with this command and you only need/should do this for the failed node. > > >> >>> - in general, is there a way to force the cluster to start >>> resources, if you just need to get them back online and as a >>> human have confirmed that things are okay? Something like crm >>> resource start rsc --force? >> >> ... see above ;-) > > On a related note, is there a way to way to get better information > about why the cluster is in its current state? For example, in this > situation it would be nice to be able to run a command and have the > cluster print "resources stopped until node XXX can be fenced" to > be able to quickly assess the problem with the cluster. yeah .... not all cluster command outputs and logs are user-friendly ;-) ... sorry I'm not aware of a direct way to get better information, maybe someone else? > >> >>> - how can I completely clear out saved data for the cluster and >>> start over from scratch (last-resort option)? Stopping pacemaker >>> and removing everything from /var/lib/pacemaker/cib and >>> /var/lib/pacemaker/pengine cleans the CIB, but the nodes end up >>> sitting in the "pending" state for a very long time (30 minutes >>> or more). Am I missing another directory that needs to be >>> cleared? >> >> you started with an completely empty cib and the two (or three?) >> nodes >> needed 30min to form a cluster? > Yes, in fact I cleared out both /var/lib/pacemaker/cib and > /var/lib/pacemaker/pengine > several times and most of the times after starting pacemaker again > one node would become "online" pretty quickly (less than 5 minutes), but the > other two > would remain "pending" for quite some time. I left it going overnight > and this morning all of the nodes that sounds not correct ... any logs during this time? >> >>> >>> I am going to look into making the power source for the STONITH >>> device independent of the power source for the node itself, >>> however even with that setup there's still a chance that something >>> could take out both power sources at the same time, in which case >>> manual intervention and confirmation that the node is dead would >>> be required. >> >> Pacemaker 1.1.8 supports (again) stonith topologies ... so more than >> one >> fencing device and they can be "logically" combined. > > Where can I find documentation on STONITH topologies and configuring > more than one fencing device for a single node? I don't see it mentioned > in the Cluster Labs documentation (Clusters from Scratch or Pacemaker > Explained). hmm ... good question ... beside the source code that includes an example as comment .... Best regards, Andreas > > Thanks, > > Andrew > >> >> Regards, >> Andreas >> >> -- >> Need help with Pacemaker? >> http://www.hastexo.com/now >> >>> >>> Thanks, >>> >>> Andrew >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >> >> >> >> >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Need help with Pacemaker? http://www.hastexo.com/now
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org