On 12/21/2012 04:18 PM, Andrew Martin wrote: > Hello, > > Yesterday a power failure took out one of the nodes and its STONITH device > (they share an upstream power source) in a 3-node active/passive cluster > (Corosync 2.1.0, Pacemaker 1.1.8). After logging into the cluster, I saw that > the STONITH operation had given up in failure and that none of the resources > were running on the other nodes: > Dec 20 17:59:14 [18909] quorumnode crmd: notice: > too_many_st_failures: Too many failures to fence node0 (11), giving up > > I brought the failed node back online and it rejoined the cluster, but no > more STONITH attempts were made and the resources remained stopped. > Eventually I set stonith-enabled="false" ran killall on all pacemaker-related > processes on the other (remaining) nodes, then restarted pacemaker, and the > resources successfully migrated to one of the other nodes. This seems like a > rather invasive technique. My questions about this type of situation are: > - is there a better way to tell the cluster "I have manually confirmed this > node is dead/safe"? I see there is the meatclient command, but can that only > be used with the meatware STONITH plugin?
crm node cleanup quorumnode > - in general, is there a way to force the cluster to start resources, if you > just need to get them back online and as a human have confirmed that things > are okay? Something like crm resource start rsc --force? ... see above ;-) > - how can I completely clear out saved data for the cluster and start over > from scratch (last-resort option)? Stopping pacemaker and removing everything > from /var/lib/pacemaker/cib and /var/lib/pacemaker/pengine cleans the CIB, > but the nodes end up sitting in the "pending" state for a very long time (30 > minutes or more). Am I missing another directory that needs to be cleared? you started with an completely empty cib and the two (or three?) nodes needed 30min to form a cluster? > > I am going to look into making the power source for the STONITH device > independent of the power source for the node itself, however even with that > setup there's still a chance that something could take out both power sources > at the same time, in which case manual intervention and confirmation that the > node is dead would be required. Pacemaker 1.1.8 supports (again) stonith topologies ... so more than one fencing device and they can be "logically" combined. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now > > Thanks, > > Andrew > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org