On Wed, Dec 8, 2010 at 11:58 AM, Simon Jansen <simon.jans...@googlemail.com> wrote: > Hi, > > I have set up a pacemaker cluster on Ubuntu 10.04 LTS Server. > Further I wrote an multistate OCF RA for the Rsyslog service. This RA passes > all tests that are run by the ocf-tester tool. > > Now the problem: > When I firstly start the msSyslog resource it promotes on node1 and is fully > functional. After that I set node1 to standby. The other node (node2) takes > the master role. This behaviour is just as expected. Then I set node1 to > online again to test if the failback works. There the error occurs: the crmd > exits and starts again. These actions occur in an endless loop and I can > just reboot both nodes several times to come in a functional state again. > I attached a summary of the log file so that you can see what's happening > exactly. In addition I attached the Rsyslog RA and the cluster config. > > Maybe someone has a clue why the crmd is restarting all the time after the > failback. I think that there is an error in the Rsyslog RA because the > cluster works fine when I stop the Rsyslog resource manually.
Here's the reason: Dec 8 11:15:14 node1 crmd: [31284]: ERROR: send_ipc_message: IPC Channel to 31285 is not connected Dec 8 11:15:14 node1 crmd: [31284]: ERROR: do_pe_invoke_callback: Could not contact the pengine Dec 8 11:15:14 node1 crmd: [31284]: info: do_pe_invoke_callback: Invoking the PE: query=32, ref=pe_calc-dc-1291803314-10, seq=736, quorate=1 Dec 8 11:15:14 node1 crmd: [31284]: info: pe_msg_dispatch: Received HUP from pengine:[31285] Dec 8 11:15:14 node1 crmd: [31284]: CRIT: pe_connection_destroy: Connection to the Policy Engine failed (pid=31285, uuid=2525f074-89f6-468e-8900-14d278808c31) ... Dec 8 11:15:15 node1 corosync[898]: [pcmk ] ERROR: pcmk_wait_dispatch: Child process pengine terminated with signal 11 (pid=31285, core=false) The policy engine appears to be crashing and this is causing the crmd to restart as part of the recovery. Perhaps file a bug with the Ubuntu guys to suck in a more recent version of pacemaker. If it still occurs with 1.0.10, add "ulimit -c unlimited" to the openais init script to be sure that a core file is produced (so we can figure out where/why). > > -- > > > Regards, > > Simon Jansen > > > --------------------------- > Simon Jansen > 64291 Darmstadt > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker