Hello, Unfortunately this same failure occurred again tonight, taking down a production cluster. Here is the part of the log where pengine died: Feb 11 17:05:15 storage0 pacemakerd[1572]: notice: pcmk_child_exit: Child process pengine terminated with signal 6 (pid=19357, core=128) Feb 11 17:05:16 storage0 pacemakerd[1572]: notice: pcmk_child_exit: Respawning failed child process: pengine Feb 11 17:05:16 storage0 pengine[12660]: notice: crm_add_logfile: Additional logging available in /var/log/corosync.log Feb 11 17:05:16 storage0 crmd[19358]: error: crm_ipc_read: Connection to pengine failed Feb 11 17:05:16 storage0 crmd[19358]: error: mainloop_gio_callback: Connection to pengine[0x891680] closed (I/O condition=25) Feb 11 17:05:16 storage0 crmd[19358]: crit: pe_ipc_destroy: Connection to the Policy Engine failed (pid=-1, uuid=c9aef461-386c-4e4f-b509-0c9c8d80409b) Feb 11 17:05:16 storage0 crmd[19358]: notice: save_cib_contents: Saved CIB contents after PE crash to /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b. bz2 Feb 11 17:05:16 storage0 crmd[19358]: error: do_log: FSA: Input I_ERROR from save_cib_contents() received in state S_POLICY_ENGINE Feb 11 17:05:16 storage0 crmd[19358]: warning: do_state_transition: State transition S_POLICY_ENGINE -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=save_cib_contents ] Feb 11 17:05:16 storage0 crmd[19358]: error: do_recover: Action A_RECOVER (0000000001000000) not supported Feb 11 17:05:16 storage0 crmd[19358]: warning: do_election_vote: Not voting in election, we're in state S_RECOVERY Feb 11 17:05:16 storage0 crmd[19358]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY Feb 11 17:05:16 storage0 crmd[19358]: notice: terminate_cs_connection: Disconnecting from Corosync Feb 11 17:05:16 storage0 crmd[19358]: error: do_exit: Could not recover from internal error
The rest of the log: http://sources.xes-inc.com/downloads/pengine.log Looking through the full log, it seems that pengine recovers, but perhaps not quickly enough to prevent the STONITH and resource migration? Here is the pe-core dump file mentioned in the log: http://sources.xes-inc.com/downloads/pe-core.bz2 Thanks, Andrew ----- Original Message ----- > From: "Andrew Martin" <amar...@xes-inc.com> > To: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org> > Sent: Friday, February 1, 2013 4:32:26 PM > Subject: Re: [Pacemaker] Reason for cluster resource migration > > ----- Original Message ----- > > From: "Andrew Beekhof" <and...@beekhof.net> > > To: "The Pacemaker cluster resource manager" > > <pacemaker@oss.clusterlabs.org> > > Sent: Thursday, December 6, 2012 8:36:27 PM > > Subject: Re: [Pacemaker] Reason for cluster resource migration > > > > On Wed, Dec 5, 2012 at 8:29 AM, Andrew Martin <amar...@xes-inc.com> > > wrote: > > > Hello, > > > > > > I am running a 3-node Pacemaker cluster (2 "real" nodes and 1 > > > quorum node in > > > standby) on Ubuntu 12.04 server (amd64) with Pacemaker 1.1.8 and > > > Corosync > > > 2.1.0. My cluster configuration is: > > > http://pastebin.com/6TPkWtbt > > > > > > Recently, pengine died on storage0 (where the resources were > > > running) which > > > also happened to be the DC at the time. Consequently, Pacemaker > > > went into > > > recovery mode and released its role as DC, at which point > > > storage1 > > > took over > > > the DC role and migrated the resources away from storage0 and > > > onto > > > storage1. > > > Looking through the logs, it seems like storage0 came back into > > > the > > > cluster > > > before the migration of the resources began: > > > Dec 03 08:31:20 [3165] storage1 crmd: info: > > > peer_update_callback: > > > Client storage0/peer now has status [online] (DC=true) > > > ... > > > Dec 03 08:31:20 [3164] storage1 pengine: notice: LogActions: > > > Start rscXXX (storage1) > > > > > > Thus, why did the migration occur, rather than aborting and > > > having > > > the > > > resources simply remain running on storage0? Here are the logs > > > from > > > each of > > > the nodes: > > > storage0: http://pastebin.com/ZqqnH9uf > > > storage1: http://pastebin.com/rvSLVcZs > > > > Hmm, thats an interesting one. > > Can you provide this file? It will hold the answer: > > > > Dec 03 08:31:31 [3164] storage1 pengine: notice: > > process_pe_message: Calculated Transition 1: > > /var/lib/pacemaker/pengine/pe-input-28.bz2 > > > > > > > > > > Thanks, > > > > > > Andrew > > > > > > _______________________________________________ > > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > Project Home: http://www.clusterlabs.org > > > Getting started: > > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > > Bugs: http://bugs.clusterlabs.org > > > > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > Andrew, > > Sorry for the delayed response. Here is the file you requested: > http://sources.xes-inc.com/downloads/pe-input-28.bz2 > > This same condition just occurred again on storage1 today (pengine > died, and then storage1 was STONITHed). > > Thanks, > > Andrew > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org