On Mar 14, 2012, at 9:45 AM, Florian Haas wrote: >>> The current cluster-glue package in squeeze-backports, >>> cluster-glue_1.0.9+hg2665-1~bpo60+2, has upstart disabled. >>> Double-check that you're running that version. If you do, and the >>> issue persists, please let us know. >> >> Indeed, that's the version that hit the repo last night when I decided to >> quit. This morning, I tried that version and concluded I was experiencing >> the same issue. > > Are you absolutely certain? > > Can you confirm that you're running the ~bpo60+2 (note trailing "2") > build, that you're actually running an lrmd binary from that version > (meaning: that you properly killed your lrmd prior to installing that > package), _and_ that "lrmadmin - > C" does *not* list "upstart?
Let's discard all of my previous conclusions. Apparently I was confused. Now, I'm sure I'm running +2 on all three nodes. And, I restarted pacemaker and corosync on all the nodes. I'm basing my knowledge of what versions I'm running on apt-cache policy, output copied below. From that, I'm also reasonably sure that whatever patched versions of cluster-glue and glib I built are not installed now. I can confirm that lrmadmin -C does not list upstart (also below). Nor does it leak sockets, as reported by "lsof -f | grep lrm_callback_sock". However, sometimes pacemakerd will not stop cleanly. I thought it might happen when stopping pacemaker on the current DC, but after successfully reproducing this failure twice, I couldn't do it again. Pacemakerd seems to exit, but fail to notify the other nodes of its shutdown. Syslog is flooded with "Retransmit List" messages (log attached). These persist until I stop corosync. Asked immediately after stopping pacemaker and corosync on one node, "crm status" other nodes will report that node as still online. After a while, the stopped node switches to offline; I assume some timeout is expiring and they are assuming it crashed. # lrmadmin -C There are 4 RA classes supported: lsb ocf heartbeat stonith # apt-cache policy pacemaker corosync cluster-glue libglib2.0-0 libglib2.0-0: Installed: 2.24.2-1 Candidate: 2.24.2-1 Version table: *** 2.24.2-1 0 500 http://ftp.egr.msu.edu/debian/ squeeze/main amd64 Packages 100 /var/lib/dpkg/status cluster-glue: Installed: 1.0.9+hg2665-1~bpo60+2 Candidate: 1.0.9+hg2665-1~bpo60+2 Package pin: 1.0.9+hg2665-1~bpo60+2 Version table: *** 1.0.9+hg2665-1~bpo60+2 1000 100 http://backports.debian.org/debian-backports/ squeeze-backports/main amd64 Packages 100 /var/lib/dpkg/status 1.0.6-1 1000 500 http://ftp.egr.msu.edu/debian/ squeeze/main amd64 Packages corosync: Installed: 1.4.2-1~bpo60+1 Candidate: 1.4.2-1~bpo60+1 Package pin: 1.4.2-1~bpo60+1 Version table: *** 1.4.2-1~bpo60+1 1000 100 http://backports.debian.org/debian-backports/ squeeze-backports/main amd64 Packages 100 /var/lib/dpkg/status 1.2.1-4 1000 500 http://ftp.egr.msu.edu/debian/ squeeze/main amd64 Packages pacemaker: Installed: 1.1.6-2~bpo60+1 Candidate: 1.1.6-2~bpo60+1 Package pin: 1.1.6-2~bpo60+1 Version table: *** 1.1.6-2~bpo60+1 1000 100 http://backports.debian.org/debian-backports/ squeeze-backports/main amd64 Packages 100 /var/lib/dpkg/status 1.0.9.1+hg15626-1 1000 500 http://ftp.egr.msu.edu/debian/ squeeze/main amd64 Packages
pacemaker_shutdown.log.gz
Description: GNU Zip compressed data
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org