On Sat, Feb 11, 2012 at 2:46 AM, Kiss Bence <be...@noc.elte.hu> wrote: > Hi, > > > On 01/30/2012 04:00 AM, Andrew Beekhof wrote: >> >> On Thu, Jan 26, 2012 at 2:08 AM, Kiss Bence<be...@noc.elte.hu> wrote: >>> >>> Hi, >>> >>> I am newbie to the clustering and I am trying to build a two node >>> active/passive cluster based upon the documentation: >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> >>> My systems are Fedora 14, uptodate. After forming the cluster as wrote, I >>> started to test it. (resources: drbd-> lvm-> fs ->group of services) >>> Resources moved around, nodes rebooted and killed (first I tried it in >>> virtual environment then also on real machines). >>> >>> After some events the two nodes ended up in a kind of state of >>> split-brain. >>> The crm_mon showed me that the other node is offline at both nodes >>> although >>> the drbd subsystem showed everything in sync and working. The network was >>> not the issue (ping, tcp and udp communications were fine). Nothing >>> changed >>> from the network view. >>> >>> At first the rejoining took place quiet well, but some more events after >>> it >>> took longer and after more event it didn't. The network dump showed me >>> the >>> multicast packets still coming and going. At corosync (crm_node -l) the >>> other node didn't appeared both on them. After trying configuring the cib >>> logs was full of messages like "<the other node>: not in our membership". >> >> >> That looks like a pacemaker bug. >> Can you use crm_report to grab logs from about 30 minutes prior to the >> first time you see this log until an hour after please? >> >> Attach that to a bug in bugs.clusterlabs.org and i'll take a look > > > I had created a bug report: id 5031.
Perfect, I'll look there. > > The "split-brain" is lasting every time about 5 minutes. Meanwhile the two > nodes think that the other node is dead. However the drbd is working fine, > and properly disallowing the second rebooted node to go Primary. The > crm_node -l shows only the local node. > > Meanwhile one of my question is answered. The multicast issue was a local > network issue. The local netadmin fixed it. Now it works. > > This issue seems to me similar to what James Flatten had reported at 8-th > Feb. ([Pacemaker] Question about cluster start-up in a 2 node cluster with a > node offline.) > > The stonith-enabled="false" \ > and no-quorum-policy="ignore" > > Thanks in advance, > Bence > > > >> >>> >>> I tried to erase the config (crm configure erase, cibadmin -E -f) but it >>> worked only locally. I noticed that the pacemaker process didn't started >>> up >>> normally on the node that was booting after the other. I also tried to >>> remove files from /var/lib/pengine/ and /var/lib/hearbeat/crm/ but only >>> the >>> resources are gone. It didn't help on forming a cluster without >>> resources. >>> The pacemaker process exited some 20 minutes after it started. Manual >>> starting was the same. >>> >>> After digging into google for answers I found nothing helpful. From >>> running >>> tips I changed in the /etc/corosync/service.d/pcmk file the version to >>> 1.1 >>> (this is the version of the pacemaker in this distro). I realized that >>> the >>> cluster processes were startup from corosync itself not by pacemaker. >>> Which >>> could be omitted. The cluster forming is stable after this change even >>> after >>> many many events. >>> >>> Now I reread the document mentioned above, and I wonder why it wrote the >>> "Important notice" on page 37. What is wrong theoretically with my >>> scenario? >> >> >> Having corosync start the daemons worked well for some but not others, >> thus it was unreliable. >> The notice points out a major difference between the two operating >> modes so that people will not be caught by surprise when pacemaker >> does not start. >> >>> Why does it working? Why didn't work the config suggested by the >>> document? >>> >>> Tests were done firsth on virtual machines of a Fedora 14 (1 CPU core, >>> 512Mb >>> ram, 10G disk, 1G drbd on logical volume, physical volume on drbd >>> forming >>> volgroup named cluster.)/node. >>> >>> Then on real machines. They have more cpu cores (4), more RAM (4G) and >>> more >>> disk (mirrored 750G), 180G drbd, and 100M garanteed routed link between >>> the >>> nodes 5 hops away. >>> >>> By the way how should one configure the corosync to work on multicast >>> routed >>> network? I had to create an openvpn tap link between the real nodes for >>> working. The original config with public IP-s didn't worked. Is corosync >>> equipped to cope with the multicast pim messages? Or it was a firewall >>> issue. >>> >>> Thanks in advance, >>> Bence >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org