I have prolem with a cluster that wont start up. It is running a 2 node failover (master slave) clustered ftp server using drbd to duplicate the filesystem.
Upgraded from 10.04 Lucid to 10.10 Maverick to obtain support for upstart resource agents. Running: pacemaker 1.0.9.1-2ubuntu4 corosync 1.2.1-1ubuntu1 cluster-agents 1:1.0.3-3 Before the upgrade it was working reasonably OK (except for detecting vsftpd running which I diagnosed as being due to upstart having hijacked the lsb compliant sysv startup script and replaced it with its own non-compliant version). daemon Logs show: crmd: WARN: lrm_signon: can not initiate connection crmd: [4963]: WARN: do_lrm_control: Failed to sign on to the LRM 29 (30 max) time netstat -anp shows: unix 2 [ ] DGRAM 22204 4546/lrmd which implies at least part of lrmd is running. I dont know what this implies but I cannot find any unix sockets in the filing system ps axf shows:: 25525 ? Ssl 0:00 /usr/sbin/corosync 25532 ? SLs 0:00 \_ /usr/lib/heartbeat/stonithd 25533 ? S 0:00 \_ /usr/lib/heartbeat/cib 25534 ? Z 0:00 \_ [lrmd] <defunct> 25535 ? S 0:00 \_ /usr/lib/heartbeat/attrd 25536 ? Z 0:00 \_ [pengine] <defunct> 25537 ? S 0:00 \_ /usr/lib/heartbeat/crmd 25540 ? S 0:00 \_ /usr/lib/heartbeat/cib 25541 ? S 0:00 \_ /usr/lib/heartbeat/lrmd 25542 ? S 0:00 \_ /usr/lib/heartbeat/attrd 25543 ? S 0:00 \_ /usr/lib/heartbeat/pengine 25547 ? Z 0:00 \_ [corosync] <defunct> 25548 ? Z 0:00 \_ [corosync] <defunct> 25553 ? Z 0:00 \_ [corosync] <defunct> 25555 ? Z 0:00 \_ [corosync] <defunct> 25866 ? S 0:00 \_ /usr/lib/heartbeat/crmd (This was from another run so the pids differ from above). crm_mon -1 shows: ============ Last updated: Wed Nov 17 00:13:25 2010 Stack: openais Current DC: NONE 2 Nodes configured, 2 expected votes 2 Resources configured. ============ OFFLINE: [ node1 node2 ] Clearly the Current DC:NONE is the symptom that results from lrmd not being communicative strace analysis shows initial (defunct) lrmd creating "/var/run/heartbeat/lrm_cmd_sock" and ..callback_sock then being terminated via a SIGTERM kill about 1 second later by the 2nd lrmd instance that continues running. This appears to cause the first instance to delete the socket. I havent followed the src enough yet to understand whether this is expected or an erroneous condition but it appears the missing socket is the cause of the error messages. Whether this is why my cluster wont start I am not 100% sure. It may be some form of timing condition because I did manage to get the stack running once via corosync stops and starts with a random delay in between. (I note that "/etc/init.d/corosync stop" leaves some processes running!) Can anyone help me debug and find root cause and a solution? Thanks _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker