On Sat, Dec 22, 2012 at 3:11 AM, Andreas Kurz <andr...@hastexo.com> wrote: > On 12/21/2012 04:18 PM, Andrew Martin wrote: >> Hello, >> >> Yesterday a power failure took out one of the nodes and its STONITH device >> (they share an upstream power source) in a 3-node active/passive cluster >> (Corosync 2.1.0, Pacemaker 1.1.8). After logging into the cluster, I saw >> that the STONITH operation had given up in failure and that none of the >> resources were running on the other nodes: >> Dec 20 17:59:14 [18909] quorumnode crmd: notice: >> too_many_st_failures: Too many failures to fence node0 (11), giving up >> >> I brought the failed node back online and it rejoined the cluster, but no >> more STONITH attempts were made and the resources remained stopped. >> Eventually I set stonith-enabled="false" ran killall on all >> pacemaker-related processes on the other (remaining) nodes, then restarted >> pacemaker, and the resources successfully migrated to one of the other >> nodes. This seems like a rather invasive technique. My questions about this >> type of situation are: >> - is there a better way to tell the cluster "I have manually confirmed this >> node is dead/safe"? I see there is the meatclient command, but can that only >> be used with the meatware STONITH plugin? > > crm node cleanup quorumnode
That only does the resources though. For "I have manually confirmed this node is dead/safe" you probably want stonith_admin and: -C, --confirm=value Confirm the named host is now safely down > >> - in general, is there a way to force the cluster to start resources, if >> you just need to get them back online and as a human have confirmed that >> things are okay? Something like crm resource start rsc --force? > > ... see above ;-) > >> - how can I completely clear out saved data for the cluster and start over >> from scratch (last-resort option)? Stopping pacemaker and removing >> everything from /var/lib/pacemaker/cib and /var/lib/pacemaker/pengine cleans >> the CIB, but the nodes end up sitting in the "pending" state for a very long >> time (30 minutes or more). Am I missing another directory that needs to be >> cleared? > > you started with an completely empty cib and the two (or three?) nodes > needed 30min to form a cluster? > >> >> I am going to look into making the power source for the STONITH device >> independent of the power source for the node itself, however even with that >> setup there's still a chance that something could take out both power >> sources at the same time, in which case manual intervention and confirmation >> that the node is dead would be required. > > Pacemaker 1.1.8 supports (again) stonith topologies ... so more than one > fencing device and they can be "logically" combined. > > Regards, > Andreas > > -- > Need help with Pacemaker? > http://www.hastexo.com/now > >> >> Thanks, >> >> Andrew >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > > > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org