[Pacemaker] corosync vs. pacemaker 1.1

Kiss Bence Wed, 25 Jan 2012 07:11:14 -0800

Hi,

I am newbie to the clustering and I am trying to build a two nodeactive/passive cluster based upon the documentation:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

My systems are Fedora 14, uptodate. After forming the cluster as wrote,I started to test it. (resources: drbd-> lvm-> fs ->group of services)Resources moved around, nodes rebooted and killed (first I tried it invirtual environment then also on real machines).

After some events the two nodes ended up in a kind of state ofsplit-brain. The crm_mon showed me that the other node is offline atboth nodes although the drbd subsystem showed everything in sync andworking. The network was not the issue (ping, tcp and udp communicationswere fine). Nothing changed from the network view.

At first the rejoining took place quiet well, but some more events afterit took longer and after more event it didn't. The network dump showedme the multicast packets still coming and going. At corosync (crm_node-l) the other node didn't appeared both on them. After tryingconfiguring the cib logs was full of messages like "<the other node>:not in our membership".

I tried to erase the config (crm configure erase, cibadmin -E -f) but itworked only locally. I noticed that the pacemaker process didn't startedup normally on the node that was booting after the other. I also triedto remove files from /var/lib/pengine/ and /var/lib/hearbeat/crm/ butonly the resources are gone. It didn't help on forming a cluster withoutresources. The pacemaker process exited some 20 minutes after itstarted. Manual starting was the same.

After digging into google for answers I found nothing helpful. Fromrunning tips I changed in the /etc/corosync/service.d/pcmk file theversion to 1.1 (this is the version of the pacemaker in this distro). Irealized that the cluster processes were startup from corosync itselfnot by pacemaker. Which could be omitted. The cluster forming is stableafter this change even after many many events.

Now I reread the document mentioned above, and I wonder why it wrote the"Important notice" on page 37. What is wrong theoretically with myscenario? Why does it working? Why didn't work the config suggested bythe document?

Tests were done firsth on virtual machines of a Fedora 14 (1 CPU core,512Mb ram, 10G disk, 1G drbd on logical volume, physical volume on drbdforming volgroup named cluster.)/node.

Then on real machines. They have more cpu cores (4), more RAM (4G) andmore disk (mirrored 750G), 180G drbd, and 100M garanteed routed linkbetween the nodes 5 hops away.

By the way how should one configure the corosync to work on multicastrouted network? I had to create an openvpn tap link between the realnodes for working. The original config with public IP-s didn't worked.Is corosync equipped to cope with the multicast pim messages? Or it wasa firewall issue.


Thanks in advance,
Bence

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] corosync vs. pacemaker 1.1

Reply via email to