Re: [Pacemaker] getting started - crm hangs when adding resources, even "crm ra classes" hangs

Phillip Frost Wed, 14 Mar 2012 09:07:13 -0700

On Mar 14, 2012, at 9:45 AM, Florian Haas wrote:
>>> The current cluster-glue package in squeeze-backports,
>>> cluster-glue_1.0.9+hg2665-1~bpo60+2, has upstart disabled.
>>> Double-check that you're running that version. If you do, and the
>>> issue persists, please let us know.
>> 
>> Indeed, that's the version that hit the repo last night when I decided to 
>> quit. This morning, I tried that version and concluded I was experiencing 
>> the same issue.
> 
> Are you absolutely certain?
> 
> Can you confirm that you're running the ~bpo60+2 (note trailing "2")
> build, that you're actually running an lrmd binary from that version
> (meaning: that you properly killed your lrmd prior to installing that
> package), _and_ that "lrmadmin -
> C" does *not* list "upstart?


Let's discard all of my previous conclusions. Apparently I was confused. 

Now, I'm sure I'm running +2 on all three nodes. And, I restarted pacemaker and 
corosync on all the nodes. I'm basing my knowledge of what versions I'm running 
on apt-cache policy, output copied below. From that, I'm also reasonably sure 
that whatever patched versions of cluster-glue and glib I built are not 
installed now.

I can confirm that lrmadmin -C does not list upstart (also below). Nor does it 
leak sockets, as reported by "lsof -f | grep lrm_callback_sock". However, 
sometimes pacemakerd will not stop cleanly. I thought it might happen when 
stopping pacemaker on the current DC, but after successfully reproducing this 
failure twice, I couldn't do it again. Pacemakerd seems to exit, but fail to 
notify the other nodes of its shutdown. Syslog is flooded with "Retransmit 
List" messages (log attached). These persist until I stop corosync. Asked 
immediately after stopping pacemaker and corosync on one node, "crm status" 
other nodes will report that node as still online. After a while, the stopped 
node switches to offline; I assume some timeout is expiring and they are 
assuming it crashed.

# lrmadmin -C
There are 4 RA classes supported:
lsb
ocf
heartbeat
stonith

# apt-cache policy pacemaker corosync cluster-glue libglib2.0-0
libglib2.0-0:
 Installed: 2.24.2-1
 Candidate: 2.24.2-1
 Version table:
*** 2.24.2-1 0
       500 http://ftp.egr.msu.edu/debian/ squeeze/main amd64 Packages
       100 /var/lib/dpkg/status
cluster-glue:
 Installed: 1.0.9+hg2665-1~bpo60+2
 Candidate: 1.0.9+hg2665-1~bpo60+2
 Package pin: 1.0.9+hg2665-1~bpo60+2
 Version table:
*** 1.0.9+hg2665-1~bpo60+2 1000
       100 http://backports.debian.org/debian-backports/ squeeze-backports/main 
amd64 Packages
       100 /var/lib/dpkg/status
    1.0.6-1 1000
       500 http://ftp.egr.msu.edu/debian/ squeeze/main amd64 Packages
corosync:
 Installed: 1.4.2-1~bpo60+1
 Candidate: 1.4.2-1~bpo60+1
 Package pin: 1.4.2-1~bpo60+1
 Version table:
*** 1.4.2-1~bpo60+1 1000
       100 http://backports.debian.org/debian-backports/ squeeze-backports/main 
amd64 Packages
       100 /var/lib/dpkg/status
    1.2.1-4 1000
       500 http://ftp.egr.msu.edu/debian/ squeeze/main amd64 Packages
pacemaker:
 Installed: 1.1.6-2~bpo60+1
 Candidate: 1.1.6-2~bpo60+1
 Package pin: 1.1.6-2~bpo60+1
 Version table:
*** 1.1.6-2~bpo60+1 1000
       100 http://backports.debian.org/debian-backports/ squeeze-backports/main 
amd64 Packages
       100 /var/lib/dpkg/status
    1.0.9.1+hg15626-1 1000
       500 http://ftp.egr.msu.edu/debian/ squeeze/main amd64 Packages

pacemaker_shutdown.log.gz
Description: GNU Zip compressed data

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] getting started - crm hangs when adding resources, even "crm ra classes" hangs

Reply via email to