Re: [Pacemaker] Best way to recover from failed STONITH?

Andrew Beekhof Mon, 07 Jan 2013 15:26:40 -0800

On Sat, Dec 22, 2012 at 3:11 AM, Andreas Kurz <andr...@hastexo.com> wrote:
> On 12/21/2012 04:18 PM, Andrew Martin wrote:
>> Hello,
>>
>> Yesterday a power failure took out one of the nodes and its STONITH device 
>> (they share an upstream power source) in a 3-node active/passive cluster 
>> (Corosync 2.1.0, Pacemaker 1.1.8). After logging into the cluster, I saw 
>> that the STONITH operation had given up in failure and that none of the 
>> resources were running on the other nodes:
>> Dec 20 17:59:14 [18909] quorumnode       crmd:   notice: 
>> too_many_st_failures:       Too many failures to fence node0 (11), giving up
>>
>> I brought the failed node back online and it rejoined the cluster, but no 
>> more STONITH attempts were made and the resources remained stopped. 
>> Eventually I set stonith-enabled="false" ran killall on all 
>> pacemaker-related processes on the other (remaining) nodes, then restarted 
>> pacemaker, and the resources successfully migrated to one of the other 
>> nodes. This seems like a rather invasive technique. My questions about this 
>> type of situation are:
>>  - is there a better way to tell the cluster "I have manually confirmed this 
>> node is dead/safe"? I see there is the meatclient command, but can that only 
>> be used with the meatware STONITH plugin?
>
> crm node cleanup quorumnode


That only does the resources though.
For "I have manually confirmed this node is dead/safe" you probably
want stonith_admin and:

-C, --confirm=value
              Confirm the named host is now safely down

>
>>  - in general, is there a way to force the cluster to start resources, if 
>> you just need to get them back online and as a human have confirmed that 
>> things are okay? Something like crm resource start rsc --force?
>
> ... see above ;-)
>
>>  - how can I completely clear out saved data for the cluster and start over 
>> from scratch (last-resort option)? Stopping pacemaker and removing 
>> everything from /var/lib/pacemaker/cib and /var/lib/pacemaker/pengine cleans 
>> the CIB, but the nodes end up sitting in the "pending" state for a very long 
>> time (30 minutes or more). Am I missing another directory that needs to be 
>> cleared?
>
> you started with an completely empty cib and the two (or three?) nodes
> needed 30min to form a cluster?
>
>>
>> I am going to look into making the power source for the STONITH device 
>> independent of the power source for the node itself, however even with that 
>> setup there's still a chance that something could take out both power 
>> sources at the same time, in which case manual intervention and confirmation 
>> that the node is dead would be required.
>
> Pacemaker 1.1.8 supports (again) stonith topologies ... so more than one
> fencing device and they can be "logically" combined.
>
> Regards,
> Andreas
>
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
>
>>
>> Thanks,
>>
>> Andrew
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Best way to recover from failed STONITH?

Reply via email to