Andreas, ----- Original Message ----- > From: "Andreas Kurz" <andr...@hastexo.com> > To: pacemaker@oss.clusterlabs.org > Sent: Friday, December 21, 2012 6:22:57 PM > Subject: Re: [Pacemaker] Best way to recover from failed STONITH? > > On 12/21/2012 07:47 PM, Andrew Martin wrote: > > Andreas, > > > > Thanks for the help. Please see my replies inline below. > > > > ----- Original Message ----- > >> From: "Andreas Kurz" <andr...@hastexo.com> > >> To: pacemaker@oss.clusterlabs.org > >> Sent: Friday, December 21, 2012 10:11:08 AM > >> Subject: Re: [Pacemaker] Best way to recover from failed STONITH? > >> > >> On 12/21/2012 04:18 PM, Andrew Martin wrote: > >>> Hello, > >>> > >>> Yesterday a power failure took out one of the nodes and its > >>> STONITH > >>> device (they share an upstream power source) in a 3-node > >>> active/passive cluster (Corosync 2.1.0, Pacemaker 1.1.8). After > >>> logging into the cluster, I saw that the STONITH operation had > >>> given up in failure and that none of the resources were running > >>> on > >>> the other nodes: > >>> Dec 20 17:59:14 [18909] quorumnode crmd: notice: > >>> too_many_st_failures: Too many failures to fence node0 > >>> (11), > >>> giving up > >>> > >>> I brought the failed node back online and it rejoined the > >>> cluster, > >>> but no more STONITH attempts were made and the resources remained > >>> stopped. Eventually I set stonith-enabled="false" ran killall on > >>> all pacemaker-related processes on the other (remaining) nodes, > >>> then restarted pacemaker, and the resources successfully migrated > >>> to one of the other nodes. This seems like a rather invasive > >>> technique. My questions about this type of situation are: > >>> - is there a better way to tell the cluster "I have manually > >>> confirmed this node is dead/safe"? I see there is the meatclient > >>> command, but can that only be used with the meatware STONITH > >>> plugin? > >> > >> crm node cleanup quorumnode > > > > I'm using the latest version of crmsh (1.2.1) but it doesn't seem > > to support this command: > > ah ... sorry, true ... its the "clearstate" command ... but it does a > "cleanup" ;-) > > > root@node0:~# crm --version > > 1.2.1 (Build unknown) > > root@node0:~# crm node > > crm(live)node# help > > > > Node management and status commands. > > > > Available commands: > > > > status show nodes' status as XML > > show show node > > standby put node into standby > > online set node online > > fence fence node > > clearstate Clear node state > > delete delete node > > attribute manage attributes > > utilization manage utilization attributes > > status-attr manage status attributes > > help show help (help topics for list of topics) > > end go back one level > > quit exit the program > > Also, do I run cleanup on just the node that failed, or all of > > them? > > You need to specify a node with this command and you only need/should > do > this for the failed node. > > > > > > >> > >>> - in general, is there a way to force the cluster to start > >>> resources, if you just need to get them back online and as a > >>> human have confirmed that things are okay? Something like crm > >>> resource start rsc --force? > >> > >> ... see above ;-) > > > > On a related note, is there a way to way to get better information > > about why the cluster is in its current state? For example, in this > > situation it would be nice to be able to run a command and have the > > cluster print "resources stopped until node XXX can be fenced" to > > be able to quickly assess the problem with the cluster. > > yeah .... not all cluster command outputs and logs are user-friendly > ;-) > ... sorry I'm not aware of a direct way to get better information, > maybe > someone else? > > > > >> > >>> - how can I completely clear out saved data for the cluster and > >>> start over from scratch (last-resort option)? Stopping pacemaker > >>> and removing everything from /var/lib/pacemaker/cib and > >>> /var/lib/pacemaker/pengine cleans the CIB, but the nodes end up > >>> sitting in the "pending" state for a very long time (30 minutes > >>> or more). Am I missing another directory that needs to be > >>> cleared? > >> > >> you started with an completely empty cib and the two (or three?) > >> nodes > >> needed 30min to form a cluster? > > Yes, in fact I cleared out both /var/lib/pacemaker/cib and > > /var/lib/pacemaker/pengine > > several times and most of the times after starting pacemaker again > > one node would become "online" pretty quickly (less than 5 > > minutes), but the other two > > would remain "pending" for quite some time. I left it going > > overnight > > and this morning all of the nodes > > that sounds not correct ... any logs during this time? > Yes, here's an example from today (the pending problem is a bit better now): corosync.conf - http://pastebin.com/E8vVz4ME output of corosync-cmapctl - http://pastebin.com/NwPHPbWb output of corosync-cfgtool -s - http://pastebin.com/G2dvm7aP crm configure show - http://pastebin.com/VDZAhyun crm_mon -1 - http://pastebin.com/u6DWSFG0 log from DC (vcsquorum) - http://pastebin.com/hd7QRNdK
The two "real" nodes are vcs0 and vcs1 (vcsquorum is a quorum node in standby). I have defined the IP address to bind to in the bindnetaddr field in corosync.conf instead of just specifying the network address because I've found that often times corosync will bind to the localhost interface (this time on vcs1), as you can see in the crm_mon output above. I have also commented out the 127.0.1.1 line in /etc/hosts: 127.0.0.1 localhost #127.0.1.1 vcs1 I can remove the localhost node, but it keeps coming back. Is there something wrong with my configuration that causes it to appear? I can't see any reason why the DRBD resource would not start and promote one of the nodes to master. vcs0 is online and UpToDate: Role: Secondary/Unknown Disk State: UpToDate/DUnknown Connection State: WFConnection Any ideas on why the cluster is stuck in this state, with the DRBD service only started on vcs0? I have removed the system startup script for DRBD, so it is solely controlled by Pacemaker now. Thanks, Andrew > >> > >>> > >>> I am going to look into making the power source for the STONITH > >>> device independent of the power source for the node itself, > >>> however even with that setup there's still a chance that > >>> something > >>> could take out both power sources at the same time, in which case > >>> manual intervention and confirmation that the node is dead would > >>> be required. > >> > >> Pacemaker 1.1.8 supports (again) stonith topologies ... so more > >> than > >> one > >> fencing device and they can be "logically" combined. > > > > Where can I find documentation on STONITH topologies and > > configuring > > more than one fencing device for a single node? I don't see it > > mentioned > > in the Cluster Labs documentation (Clusters from Scratch or > > Pacemaker Explained). > > hmm ... good question ... beside the source code that includes an > example as comment .... > > Best regards, > Andreas > > > > > Thanks, > > > > Andrew > > > >> > >> Regards, > >> Andreas > >> > >> -- > >> Need help with Pacemaker? > >> http://www.hastexo.com/now > >> > >>> > >>> Thanks, > >>> > >>> Andrew > >>> > >>> _______________________________________________ > >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >>> > >>> Project Home: http://www.clusterlabs.org > >>> Getting started: > >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >>> Bugs: http://bugs.clusterlabs.org > >>> > >> > >> > >> > >> > >> > >> > >> _______________________________________________ > >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >> > >> Project Home: http://www.clusterlabs.org > >> Getting started: > >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >> Bugs: http://bugs.clusterlabs.org > >> > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > > -- > Need help with Pacemaker? > http://www.hastexo.com/now > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org