Re: [Pacemaker] Best way to recover from failed STONITH?

Andreas Kurz Fri, 21 Dec 2012 08:16:04 -0800

On 12/21/2012 04:18 PM, Andrew Martin wrote:
> Hello,
> 
> Yesterday a power failure took out one of the nodes and its STONITH device 
> (they share an upstream power source) in a 3-node active/passive cluster 
> (Corosync 2.1.0, Pacemaker 1.1.8). After logging into the cluster, I saw that 
> the STONITH operation had given up in failure and that none of the resources 
> were running on the other nodes:
> Dec 20 17:59:14 [18909] quorumnode       crmd:   notice: 
> too_many_st_failures:       Too many failures to fence node0 (11), giving up
> 
> I brought the failed node back online and it rejoined the cluster, but no 
> more STONITH attempts were made and the resources remained stopped. 
> Eventually I set stonith-enabled="false" ran killall on all pacemaker-related 
> processes on the other (remaining) nodes, then restarted pacemaker, and the 
> resources successfully migrated to one of the other nodes. This seems like a 
> rather invasive technique. My questions about this type of situation are:
>  - is there a better way to tell the cluster "I have manually confirmed this 
> node is dead/safe"? I see there is the meatclient command, but can that only 
> be used with the meatware STONITH plugin?


crm node cleanup quorumnode

>  - in general, is there a way to force the cluster to start resources, if you 
> just need to get them back online and as a human have confirmed that things 
> are okay? Something like crm resource start rsc --force?

... see above ;-)

>  - how can I completely clear out saved data for the cluster and start over 
> from scratch (last-resort option)? Stopping pacemaker and removing everything 
> from /var/lib/pacemaker/cib and /var/lib/pacemaker/pengine cleans the CIB, 
> but the nodes end up sitting in the "pending" state for a very long time (30 
> minutes or more). Am I missing another directory that needs to be cleared?

you started with an completely empty cib and the two (or three?) nodes
needed 30min to form a cluster?

> 
> I am going to look into making the power source for the STONITH device 
> independent of the power source for the node itself, however even with that 
> setup there's still a chance that something could take out both power sources 
> at the same time, in which case manual intervention and confirmation that the 
> node is dead would be required.

Pacemaker 1.1.8 supports (again) stonith topologies ... so more than one
fencing device and they can be "logically" combined.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Thanks,
> 
> Andrew
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Best way to recover from failed STONITH?

Reply via email to