Re: [Pacemaker] getting started - crm hangs when adding resources, even "crm ra classes" hangs

Andreas Kurz Wed, 14 Mar 2012 15:10:33 -0700

On 03/14/2012 05:55 PM, Phillip Frost wrote:
> On Mar 14, 2012, at 12:33 PM, Florian Haas wrote:
> 
>>> However, sometimes pacemakerd will not stop cleanly.
>>
>> OK. Whether this is related to your original problem or not a complete
>> open question, jftr.
>>
>>> I thought it might happen when stopping pacemaker on the current DC, but 
>>> after successfully reproducing this failure twice, I couldn't do it again. 
>>> Pacemakerd seems to exit, but fail to notify the other nodes of its 
>>> shutdown. Syslog is flooded with "Retransmit List" messages (log attached). 
>>> These persist until I stop corosync. Asked immediately after stopping 
>>> pacemaker and corosync on one node, "crm status" other nodes will report 
>>> that node as still online. After a while, the stopped node switches to 
>>> offline; I assume some timeout is expiring and they are assuming it crashed.
>>
>> You didn't give much other information, so I'm asking this on a hunch:
>> does your pacemaker service configuration stanza for corosync (either
>> in /etc/corosync/corosync.conf or in
>> /etc/corosync/service.d/pacemaker) say "ver: 0" or "ver: 1"?
> 
> I'm not sure if this is the same problem or not. I did experience a symptom 
> that looked to my inexperienced eyes very similar before I installed 
> 1.0.9+hg2665-1~bpo60+2 - that is, I'd try to stop pacemaker, and it wouldn't 
> stop, and I'd get that flood of retransmits in syslog.
> 
> To answer your question, I am using "ver: 1". It's worth mentioning that the 
> corosync.conf that comes with the packages in squeeze-backports has a service 
> block with ver: 0 in it, which took me some time to discover. However, I've 
> long ago removed it. Syslog seems to verify that ver: 1 is in effect:
> 
> Mar 14 12:02:34 xenhost02 pacemakerd: [7925]: info: get_config_opt: Found 
> 'pacemaker' for option: name
> Mar 14 12:02:34 xenhost02 pacemakerd: [7925]: info: get_config_opt: Found '1' 
> for option: ver
> 
> After playing with this system more, it seems this problem of "Retransmit 
> List" being flooded to syslog is not only on pacemakerd shutdown. For 
> example, I was just trying to add a DRBD resource, and crm got hung up at 
> "cib commit":
> 
> crm(drbd)# cib commit drbd
> [long pause, some minutes long]
> Could not commit shadow instance 'drbd' to the CIB: Remote node did not 
> respond
> ERROR: failed to commit the drbd shadow CIB
> 
> "corosync[7915]:   [TOTEM ] Retransmit List: b7 b8 b9" is being flooded to 
> syslog.
> 
> Every time I try to reproduce this, I can once or twice, but then no more. 
> I'm beginning to think that to set this up, a node has to have been running 
> for some time. I can reproduce it a few times because I try it on each node. 
> Then I have to restart corosync on each node to get things working again, and 
> after that, everything is fine, until I move on, spend some time reading 
> documentation, and try again.
> 
> I'm assuming these "Retransmit List" messages in syslog indicate that 
> corosync attempted to send a message to other nodes, did not receive 
> acknowledgement, and is thus attempting to resend them. I know corosync uses 
> IP multicast to communicate with the other nodes. Is it possible that my 
> network is doing something that breaks multicast connectivity? Multicast IP 
> isn't something I've ever had to deal with, so I'm not really sure. It's hard 
> to find anything that talks about configuring a network for multicast that 
> doesn't start talking about IP routers, which isn't relevant in my setup 
> because all the cluster nodes are on the same VLAN, on the same switch. Could 
> this be an issue? Is there a lower-level utility (like, ping) that I can use 
> to verify multicast IP at a lower level?
>


Beside testing broadcasts or unicasts (upnp) for corosync ... have you
checked for MTU size problems ... corosync uses the 1500 bytes per
default, as expected from standard ethernet?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] getting started - crm hangs when adding resources, even "crm ra classes" hangs

Reply via email to