On Fri, Nov 19, 2010 at 12:08 AM, Dave Williams <d...@opensourcesolutions.co.uk> wrote: > I have prolem with a cluster that wont start up. It is running a 2 node > failover (master slave) clustered ftp server using drbd to duplicate the > filesystem. > > Upgraded from 10.04 Lucid to 10.10 Maverick to obtain support for > upstart resource agents. > Running: > pacemaker 1.0.9.1-2ubuntu4 > corosync 1.2.1-1ubuntu1 > cluster-agents 1:1.0.3-3 > > Before the upgrade it was working reasonably OK (except for detecting > vsftpd running which I diagnosed as being due to upstart having hijacked > the lsb compliant sysv startup script and replaced it with its own > non-compliant version). > > daemon Logs show: > > crmd: WARN: lrm_signon: can not initiate connection > crmd: [4963]: WARN: do_lrm_control: Failed to sign on to the LRM 29 (30 > max) time > > netstat -anp shows: > unix 2 [ ] DGRAM 22204 4546/lrmd > > which implies at least part of lrmd is running. > I dont know what this implies but I cannot find any unix sockets in the > filing system > > ps axf shows:: > 25525 ? Ssl 0:00 /usr/sbin/corosync > 25532 ? SLs 0:00 \_ /usr/lib/heartbeat/stonithd > 25533 ? S 0:00 \_ /usr/lib/heartbeat/cib > 25534 ? Z 0:00 \_ [lrmd] <defunct> > 25535 ? S 0:00 \_ /usr/lib/heartbeat/attrd > 25536 ? Z 0:00 \_ [pengine] <defunct> > 25537 ? S 0:00 \_ /usr/lib/heartbeat/crmd > 25540 ? S 0:00 \_ /usr/lib/heartbeat/cib > 25541 ? S 0:00 \_ /usr/lib/heartbeat/lrmd > 25542 ? S 0:00 \_ /usr/lib/heartbeat/attrd > 25543 ? S 0:00 \_ /usr/lib/heartbeat/pengine > 25547 ? Z 0:00 \_ [corosync] <defunct> > 25548 ? Z 0:00 \_ [corosync] <defunct> > 25553 ? Z 0:00 \_ [corosync] <defunct> > 25555 ? Z 0:00 \_ [corosync] <defunct> > 25866 ? S 0:00 \_ /usr/lib/heartbeat/crmd
This install is seriously sick. Multiple copies of all our daemons. If I had to guess, I'd say there were version incompatibilities between the various cluster packages. > > (This was from another run so the pids differ from above). > > crm_mon -1 shows: > > ============ > Last updated: Wed Nov 17 00:13:25 2010 > Stack: openais > Current DC: NONE > 2 Nodes configured, 2 expected votes > 2 Resources configured. > ============ > > OFFLINE: [ node1 node2 ] > > Clearly the Current DC:NONE is the symptom that results from lrmd not > being communicative > > strace analysis shows initial (defunct) lrmd creating > "/var/run/heartbeat/lrm_cmd_sock" and ..callback_sock > then being terminated via a SIGTERM kill about 1 second later by the 2nd lrmd > instance that continues running. This appears to cause the first > instance to delete the socket. > > I havent followed the src enough yet to understand whether this is expected > or an erroneous condition but it appears the missing socket is the cause > of the error messages. Whether this is why my cluster wont start I am > not 100% sure. > > It may be some form of timing condition because I did manage to get the > stack running once via corosync stops and starts with a random delay in > between. > (I note that "/etc/init.d/corosync stop" leaves some processes running!) > > Can anyone help me debug and find root cause and a solution? > > Thanks > > > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker