Hi If I disjoin one clusternode (node01) for maintenance-purposes (/etc/init.d/openais stop) and reboot this node, then it will not join himself automatically into the cluster. After the reboot, I have the following error- and warn-messages in the log:
Sep 3 07:34:09 node01 mgmtd: [9201]: ERROR: Can't initialize management library.Shutting down.(-1) Sep 3 07:34:10 node01 corosync[8841]: [pcmk ] ERROR: pcmk_wait_dispatch: Child process mgmtd exited (pid=9201, rc=1) Sep 3 07:34:10 node01 corosync[8841]: [pcmk ] notice: pcmk_wait_dispatch: Respawning failed child process: mgmtd Sep 3 07:34:10 node01 corosync[8841]: [pcmk ] info: spawn_child: Forked child 9202 for process mgmtd Sep 3 07:34:10 node01 mgmtd: [9202]: info: Pacemaker-mgmt Hg Version: 0f1490eaa8d83db534385670fdcbd154407d51cc Sep 3 07:34:10 node01 mgmtd: [9202]: info: G_main_add_SignalHandler: Added signal handler for signal 15 Sep 3 07:34:10 node01 mgmtd: [9202]: debug: Enabling coredumps Sep 3 07:34:10 node01 mgmtd: [9202]: WARN: Core dumps could be lost if multiple dumps occur. Sep 3 07:34:10 node01 mgmtd: [9202]: WARN: Consider setting non-default value in /proc/sys/kernel/core_pattern (or equivalent) for maximum supportability Sep 3 07:34:10 node01 mgmtd: [9202]: WARN: Consider setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability Sep 3 07:34:10 node01 mgmtd: [9202]: info: G_main_add_SignalHandler: Added signal handler for signal 10 Sep 3 07:34:10 node01 mgmtd: [9202]: info: G_main_add_SignalHandler: Added signal handler for signal 12 Sep 3 07:34:10 node01 mgmtd: [9202]: info: init_crm: live Sep 3 07:34:10 node01 mgmtd: [9202]: info: login to cib live: 0, ret:-10 Sep 3 07:34:11 node01 mgmtd: [9202]: info: login to cib live: 1, ret:-10 Sep 3 07:34:12 node01 mgmtd: [9202]: info: login to cib live: 2, ret:-10 Sep 3 07:34:13 node01 mgmtd: [9202]: info: login to cib live: 3, ret:-10 Sep 3 07:34:14 node01 mgmtd: [9202]: info: login to cib live: 4, ret:-10 Sep 3 07:34:15 node01 mgmtd: [9202]: info: login to cib failed: live ... ... Sep 3 07:39:04 node01 corosync[8841]: [pcmk ] ERROR: pcmk_wait_dispatch: Child process mgmtd exited (pid=9250, rc=1) Sep 3 07:39:04 node01 corosync[8841]: [pcmk ] notice: pcmk_wait_dispatch: Respawning failed child process: mgmtd Sep 3 07:39:04 node01 corosync[8841]: [pcmk ] info: spawn_child: Forked child 9251 for process mgmtd Sep 3 07:39:04 node01 mgmtd: [9251]: info: Pacemaker-mgmt Hg Version: 0f1490eaa8d83db534385670fdcbd154407d51cc Sep 3 07:39:04 node01 mgmtd: [9251]: info: G_main_add_SignalHandler: Added signal handler for signal 15 Sep 3 07:39:04 node01 mgmtd: [9251]: debug: Enabling coredumps Sep 3 07:39:04 node01 mgmtd: [9251]: WARN: Core dumps could be lost if multiple dumps occur. Sep 3 07:39:04 node01 mgmtd: [9251]: WARN: Consider setting non-default value in /proc/sys/kernel/core_pattern (or equivalent) for maximum supportability Sep 3 07:39:04 node01 mgmtd: [9251]: WARN: Consider setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability Sep 3 07:39:04 node01 mgmtd: [9251]: info: G_main_add_SignalHandler: Added signal handler for signal 10 Sep 3 07:39:04 node01 mgmtd: [9251]: info: G_main_add_SignalHandler: Added signal handler for signal 12 Sep 3 07:39:04 node01 mgmtd: [9251]: info: init_crm: live Sep 3 07:39:04 node01 mgmtd: [9251]: info: login to cib live: 0, ret:-10 Sep 3 07:39:05 node01 mgmtd: [9251]: info: login to cib live: 1, ret:-10 Sep 3 07:39:06 node01 mgmtd: [9251]: info: login to cib live: 2, ret:-10 Sep 3 07:39:07 node01 mgmtd: [9251]: info: login to cib live: 3, ret:-10 Sep 3 07:39:08 node01 mgmtd: [9251]: info: login to cib live: 4, ret:-10 Sep 3 07:39:09 node01 mgmtd: [9251]: info: login to cib failed: live Sep 3 07:39:09 node01 mgmtd: [9251]: ERROR: Can't initialize management library.Shutting down.(-1) Sep 3 07:39:10 node01 corosync[8841]: [pcmk ] ERROR: pcmk_wait_dispatch: Child process mgmtd exited (pid=9251, rc=1) Sep 3 07:39:10 node01 corosync[8841]: [pcmk ] ERROR: pcmk_wait_dispatch: Child respawn count exceeded by mgmtd Sep 3 07:39:10 node01 corosync[8841]: [pcmk ] info: update_member: Node node01 now has process list: 00000000000000000000000000110312 (1114898) Sep 3 07:39:10 node01 corosync[8841]: [pcmk ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2) After exactly 20 minutes (07:39:10), the node stops trying joining into the cluster. Have someone any hints for this behaviour? We're running corosync on a sles11-sp1 system. # hb_report -V cluster-glue: 1.0.5 (1448deafdf79754d12c36993d633bb2a68f82034) Thanks a lot. Tom _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker