----- Original Message ----- > From: "Andrew Beekhof" <and...@beekhof.net> > To: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org> > Sent: Tuesday, February 12, 2013 10:52:23 PM > Subject: Re: [Pacemaker] Reason for cluster resource migration > > On Wed, Feb 13, 2013 at 2:04 AM, Andrew Martin <amar...@xes-inc.com> > wrote: > > ----- Original Message ----- > >> From: "Andrew Beekhof" <and...@beekhof.net> > >> To: "The Pacemaker cluster resource manager" > >> <pacemaker@oss.clusterlabs.org> > >> Sent: Monday, February 11, 2013 10:11:53 PM > >> Subject: Re: [Pacemaker] Reason for cluster resource migration > >> > >> On Tue, Feb 12, 2013 at 3:07 PM, Andrew Beekhof > >> <and...@beekhof.net> > >> wrote: > >> > On Tue, Feb 12, 2013 at 3:01 PM, Andrew Beekhof > >> > <and...@beekhof.net> wrote: > >> >> On Tue, Feb 12, 2013 at 1:40 PM, Andrew Martin > >> >> <amar...@xes-inc.com> wrote: > >> >>> Hello, > >> >>> > >> >>> Unfortunately this same failure occurred again tonight, > >> >> > >> >> It might be the same effect, but there was no indication that > >> >> the > >> >> PE > >> >> died last time. > >> >> > >> >>> taking down a production cluster. Here is the part of the log > >> >>> where pengine died: > >> >>> Feb 11 17:05:15 storage0 pacemakerd[1572]: notice: > >> >>> pcmk_child_exit: Child process pengine terminated with signal > >> >>> 6 > >> >>> (pid=19357, core=128) > >> >>> Feb 11 17:05:16 storage0 pacemakerd[1572]: notice: > >> >>> pcmk_child_exit: Respawning failed child process: pengine > >> >>> Feb 11 17:05:16 storage0 pengine[12660]: notice: > >> >>> crm_add_logfile: Additional logging available in > >> >>> /var/log/corosync.log > >> >>> Feb 11 17:05:16 storage0 crmd[19358]: error: crm_ipc_read: > >> >>> Connection to pengine failed > >> >>> Feb 11 17:05:16 storage0 crmd[19358]: error: > >> >>> mainloop_gio_callback: Connection to pengine[0x891680] closed > >> >>> (I/O condition=25) > >> >>> Feb 11 17:05:16 storage0 crmd[19358]: crit: > >> >>> pe_ipc_destroy: > >> >>> Connection to the Policy Engine failed (pid=-1, > >> >>> uuid=c9aef461-386c-4e4f-b509-0c9c8d80409b) > >> >>> Feb 11 17:05:16 storage0 crmd[19358]: notice: > >> >>> save_cib_contents: Saved CIB contents after PE crash to > >> >>> /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b. > >> >>> bz2 > >> >>> Feb 11 17:05:16 storage0 crmd[19358]: error: do_log: FSA: > >> >>> Input I_ERROR from save_cib_contents() received in state > >> >>> S_POLICY_ENGINE > >> >>> Feb 11 17:05:16 storage0 crmd[19358]: warning: > >> >>> do_state_transition: State transition S_POLICY_ENGINE -> > >> >>> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL > >> >>> origin=save_cib_contents ] > >> >>> Feb 11 17:05:16 storage0 crmd[19358]: error: do_recover: > >> >>> Action A_RECOVER (0000000001000000) not supported > >> >>> Feb 11 17:05:16 storage0 crmd[19358]: warning: > >> >>> do_election_vote: > >> >>> Not voting in election, we're in state S_RECOVERY > >> >>> Feb 11 17:05:16 storage0 crmd[19358]: error: do_log: FSA: > >> >>> Input I_TERMINATE from do_recover() received in state > >> >>> S_RECOVERY > >> >>> Feb 11 17:05:16 storage0 crmd[19358]: notice: > >> >>> terminate_cs_connection: Disconnecting from Corosync > >> >>> Feb 11 17:05:16 storage0 crmd[19358]: error: do_exit: Could > >> >>> not recover from internal error > >> >>> > >> >>> The rest of the log: > >> >>> http://sources.xes-inc.com/downloads/pengine.log > >> >>> Looking through the full log, it seems that pengine recovers, > >> >> > >> >> Right, pacemakerd watches for this and restarts it. > >> >> > >> >>> but perhaps not quickly enough to prevent the STONITH and > >> >>> resource migration? > >> >> > >> >> Highly likely. > >> >> However the PE crashing is quite serious. I'd like to get to > >> >> the > >> >> bottom of that ASAP. > >> >> > >> >>> > >> >>> Here is the pe-core dump file mentioned in the log: > >> >>> http://sources.xes-inc.com/downloads/pe-core.bz2 > >> >> > >> >> Unfortunately core files are specific to the machine that > >> >> generated them. > >> >> If you create a crm_report for about that time, it will open it > >> >> and > >> >> record a backtrace for us to look at. > >> >> > >> >> Also very important is the contents of: > >> >> > >> >> /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.bz2 > >> > > >> > Ohhh, thats what the pe-core link was. > >> > I've run it through crm_simulate but couldn't reproduce the > >> > crash. > >> > > >> > So we'll still need the crm_report, it will have more detail on > >> > the > >> > "Child process pengine terminated with signal 6 (pid=19357, > >> > core=128)" > >> > part. > >> > >> Signal 6 is an assertion failure, but strangely there is no > >> mention > >> of > >> one in syslog. > >> Can you grep /var/log/corosync.log for lines containing 19357 > >> please? > >> > > Andrew, > > > > Thanks for the help. Here are the lines containing 19357: > > http://sources.xes-inc.com/downloads/19357.log > > cl_sysadmin_notify is a clone of a ocf:heartbeat:MailTo resource. > > Postfix > > is installed and running, so I am not sure why these failures are > > occurring. > > > >> > The core file will likely be somewhere under > >> > /var/lib/pacemaker/cores > > That directory doesn't exist on this server, and it doesn't appear > > to be in /var/crash either: > > It looks like /var/lib/heartbeat/cores/ on your system. > > > # ls /var/crash/ -ltr > > total 67548 > > -rw-r----- 1 hacluster whoopsie 1293711 Feb 6 10:01 > > _usr_libexec_pacemaker_pengine.110.crash > > ---------- 1 root whoopsie 67874816 Feb 11 17:07 > > _usr_libexec_pacemaker_lrmd.0.crash > > In case they would be helpful, here are those two files: > > http://sources.xes-inc.com/downloads/_usr_libexec_pacemaker_pengine.110.crash > > http://sources.xes-inc.com/downloads/_usr_libexec_pacemaker_lrmd.0.crash > > > > Here is the crm_report from storage0 from this time period: > > http://sources.xes-inc.com/downloads/pengine-report.tar.bz2 > > Are you sure? > The pengine crashed on "Feb 11 17:05:15" but the report appears to be > from "Tue Feb 12 09:59:50 EST 2013" to "Tue Feb 12 10:30:10 EST 2013" > > There was one crash in there, but it was of the lrmd. > Unfortunately it looks like the binaries and libraries have been > stripped. > > Where did you get them from? Do you know how to install the -debug > packages?
Andrew, I ran crm_report again as follows: # crm_report -f "2013-02-11 17:00:00" -t "2013-02-11 17:30:00" \ -n "storage0 storage1 storagequorum" -C /tmp/report ... storage0: Collecting data from storage0 storage1 storagequorum (02/11/2013 05:00:00 PM to 02/11/2013 05:30:00 PM) ... storage1: Found core file: -rw-r----- 1 root root 18485248 Feb 11 17:10 /var/lib/heartbeat/cores/root/core.7678 Here is the report it generated: http://sources.xes-inc.com/downloads/storage-report.bz2 I created these packages with checkinstall (using the normal Pacemaker build process, but substituting checkinstall for "make install"). By default it strips debugging information when generating the package, which I thought was desireable for a production environment. I also have a debug version of the package, which I will install now. I am also working to build Ubuntu packages more officially using dpkg-buildpackage. Is there a better way to create these packages? I would prefer to not have to install build tools and compile the source directly on production servers. Thanks, Andrew > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org