----- Original Message ----- > From: "Andrew Beekhof" <and...@beekhof.net> > To: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org> > Sent: Monday, February 11, 2013 10:11:53 PM > Subject: Re: [Pacemaker] Reason for cluster resource migration > > On Tue, Feb 12, 2013 at 3:07 PM, Andrew Beekhof <and...@beekhof.net> > wrote: > > On Tue, Feb 12, 2013 at 3:01 PM, Andrew Beekhof > > <and...@beekhof.net> wrote: > >> On Tue, Feb 12, 2013 at 1:40 PM, Andrew Martin > >> <amar...@xes-inc.com> wrote: > >>> Hello, > >>> > >>> Unfortunately this same failure occurred again tonight, > >> > >> It might be the same effect, but there was no indication that the > >> PE > >> died last time. > >> > >>> taking down a production cluster. Here is the part of the log > >>> where pengine died: > >>> Feb 11 17:05:15 storage0 pacemakerd[1572]: notice: > >>> pcmk_child_exit: Child process pengine terminated with signal 6 > >>> (pid=19357, core=128) > >>> Feb 11 17:05:16 storage0 pacemakerd[1572]: notice: > >>> pcmk_child_exit: Respawning failed child process: pengine > >>> Feb 11 17:05:16 storage0 pengine[12660]: notice: > >>> crm_add_logfile: Additional logging available in > >>> /var/log/corosync.log > >>> Feb 11 17:05:16 storage0 crmd[19358]: error: crm_ipc_read: > >>> Connection to pengine failed > >>> Feb 11 17:05:16 storage0 crmd[19358]: error: > >>> mainloop_gio_callback: Connection to pengine[0x891680] closed > >>> (I/O condition=25) > >>> Feb 11 17:05:16 storage0 crmd[19358]: crit: pe_ipc_destroy: > >>> Connection to the Policy Engine failed (pid=-1, > >>> uuid=c9aef461-386c-4e4f-b509-0c9c8d80409b) > >>> Feb 11 17:05:16 storage0 crmd[19358]: notice: > >>> save_cib_contents: Saved CIB contents after PE crash to > >>> /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b. > >>> bz2 > >>> Feb 11 17:05:16 storage0 crmd[19358]: error: do_log: FSA: > >>> Input I_ERROR from save_cib_contents() received in state > >>> S_POLICY_ENGINE > >>> Feb 11 17:05:16 storage0 crmd[19358]: warning: > >>> do_state_transition: State transition S_POLICY_ENGINE -> > >>> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL > >>> origin=save_cib_contents ] > >>> Feb 11 17:05:16 storage0 crmd[19358]: error: do_recover: > >>> Action A_RECOVER (0000000001000000) not supported > >>> Feb 11 17:05:16 storage0 crmd[19358]: warning: do_election_vote: > >>> Not voting in election, we're in state S_RECOVERY > >>> Feb 11 17:05:16 storage0 crmd[19358]: error: do_log: FSA: > >>> Input I_TERMINATE from do_recover() received in state S_RECOVERY > >>> Feb 11 17:05:16 storage0 crmd[19358]: notice: > >>> terminate_cs_connection: Disconnecting from Corosync > >>> Feb 11 17:05:16 storage0 crmd[19358]: error: do_exit: Could > >>> not recover from internal error > >>> > >>> The rest of the log: > >>> http://sources.xes-inc.com/downloads/pengine.log > >>> Looking through the full log, it seems that pengine recovers, > >> > >> Right, pacemakerd watches for this and restarts it. > >> > >>> but perhaps not quickly enough to prevent the STONITH and > >>> resource migration? > >> > >> Highly likely. > >> However the PE crashing is quite serious. I'd like to get to the > >> bottom of that ASAP. > >> > >>> > >>> Here is the pe-core dump file mentioned in the log: > >>> http://sources.xes-inc.com/downloads/pe-core.bz2 > >> > >> Unfortunately core files are specific to the machine that > >> generated them. > >> If you create a crm_report for about that time, it will open it > >> and > >> record a backtrace for us to look at. > >> > >> Also very important is the contents of: > >> > >> /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.bz2 > > > > Ohhh, thats what the pe-core link was. > > I've run it through crm_simulate but couldn't reproduce the crash. > > > > So we'll still need the crm_report, it will have more detail on the > > "Child process pengine terminated with signal 6 (pid=19357, > > core=128)" > > part. > > Signal 6 is an assertion failure, but strangely there is no mention > of > one in syslog. > Can you grep /var/log/corosync.log for lines containing 19357 please? > Andrew,
Thanks for the help. Here are the lines containing 19357: http://sources.xes-inc.com/downloads/19357.log cl_sysadmin_notify is a clone of a ocf:heartbeat:MailTo resource. Postfix is installed and running, so I am not sure why these failures are occurring. > > The core file will likely be somewhere under > > /var/lib/pacemaker/cores That directory doesn't exist on this server, and it doesn't appear to be in /var/crash either: # ls /var/crash/ -ltr total 67548 -rw-r----- 1 hacluster whoopsie 1293711 Feb 6 10:01 _usr_libexec_pacemaker_pengine.110.crash ---------- 1 root whoopsie 67874816 Feb 11 17:07 _usr_libexec_pacemaker_lrmd.0.crash In case they would be helpful, here are those two files: http://sources.xes-inc.com/downloads/_usr_libexec_pacemaker_pengine.110.crash http://sources.xes-inc.com/downloads/_usr_libexec_pacemaker_lrmd.0.crash Here is the crm_report from storage0 from this time period: http://sources.xes-inc.com/downloads/pengine-report.tar.bz2 Thanks, Andrew > > but crm_report should be able to find it. > > > >> > >>> > >>> Thanks, > >>> > >>> Andrew > >>> > >>> > >>> > >>> > >>> ----- Original Message ----- > >>>> From: "Andrew Martin" <amar...@xes-inc.com> > >>>> To: "The Pacemaker cluster resource manager" > >>>> <pacemaker@oss.clusterlabs.org> > >>>> Sent: Friday, February 1, 2013 4:32:26 PM > >>>> Subject: Re: [Pacemaker] Reason for cluster resource migration > >>>> > >>>> ----- Original Message ----- > >>>> > From: "Andrew Beekhof" <and...@beekhof.net> > >>>> > To: "The Pacemaker cluster resource manager" > >>>> > <pacemaker@oss.clusterlabs.org> > >>>> > Sent: Thursday, December 6, 2012 8:36:27 PM > >>>> > Subject: Re: [Pacemaker] Reason for cluster resource migration > >>>> > > >>>> > On Wed, Dec 5, 2012 at 8:29 AM, Andrew Martin > >>>> > <amar...@xes-inc.com> > >>>> > wrote: > >>>> > > Hello, > >>>> > > > >>>> > > I am running a 3-node Pacemaker cluster (2 "real" nodes and > >>>> > > 1 > >>>> > > quorum node in > >>>> > > standby) on Ubuntu 12.04 server (amd64) with Pacemaker 1.1.8 > >>>> > > and > >>>> > > Corosync > >>>> > > 2.1.0. My cluster configuration is: > >>>> > > http://pastebin.com/6TPkWtbt > >>>> > > > >>>> > > Recently, pengine died on storage0 (where the resources were > >>>> > > running) which > >>>> > > also happened to be the DC at the time. Consequently, > >>>> > > Pacemaker > >>>> > > went into > >>>> > > recovery mode and released its role as DC, at which point > >>>> > > storage1 > >>>> > > took over > >>>> > > the DC role and migrated the resources away from storage0 > >>>> > > and > >>>> > > onto > >>>> > > storage1. > >>>> > > Looking through the logs, it seems like storage0 came back > >>>> > > into > >>>> > > the > >>>> > > cluster > >>>> > > before the migration of the resources began: > >>>> > > Dec 03 08:31:20 [3165] storage1 crmd: info: > >>>> > > peer_update_callback: > >>>> > > Client storage0/peer now has status [online] (DC=true) > >>>> > > ... > >>>> > > Dec 03 08:31:20 [3164] storage1 pengine: notice: > >>>> > > LogActions: > >>>> > > Start rscXXX (storage1) > >>>> > > > >>>> > > Thus, why did the migration occur, rather than aborting and > >>>> > > having > >>>> > > the > >>>> > > resources simply remain running on storage0? Here are the > >>>> > > logs > >>>> > > from > >>>> > > each of > >>>> > > the nodes: > >>>> > > storage0: http://pastebin.com/ZqqnH9uf > >>>> > > storage1: http://pastebin.com/rvSLVcZs > >>>> > > >>>> > Hmm, thats an interesting one. > >>>> > Can you provide this file? It will hold the answer: > >>>> > > >>>> > Dec 03 08:31:31 [3164] storage1 pengine: notice: > >>>> > process_pe_message: Calculated Transition 1: > >>>> > /var/lib/pacemaker/pengine/pe-input-28.bz2 > >>>> > > >>>> > > >>>> > > > >>>> > > Thanks, > >>>> > > > >>>> > > Andrew > >>>> > > > >>>> > > _______________________________________________ > >>>> > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >>>> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >>>> > > > >>>> > > Project Home: http://www.clusterlabs.org > >>>> > > Getting started: > >>>> > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >>>> > > Bugs: http://bugs.clusterlabs.org > >>>> > > > >>>> > > >>>> > _______________________________________________ > >>>> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >>>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >>>> > > >>>> > Project Home: http://www.clusterlabs.org > >>>> > Getting started: > >>>> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >>>> > Bugs: http://bugs.clusterlabs.org > >>>> > > >>>> > >>>> Andrew, > >>>> > >>>> Sorry for the delayed response. Here is the file you requested: > >>>> http://sources.xes-inc.com/downloads/pe-input-28.bz2 > >>>> > >>>> This same condition just occurred again on storage1 today > >>>> (pengine > >>>> died, and then storage1 was STONITHed). > >>>> > >>>> Thanks, > >>>> > >>>> Andrew > >>>> > >>>> _______________________________________________ > >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >>>> > >>>> Project Home: http://www.clusterlabs.org > >>>> Getting started: > >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >>>> Bugs: http://bugs.clusterlabs.org > >>>> > >>> > >>> _______________________________________________ > >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >>> > >>> Project Home: http://www.clusterlabs.org > >>> Getting started: > >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >>> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org