Hi Andrew,
did You have the time to look at the bug report? Are there anything
missing from it?
An other question. Maybe it better helps for me to understand the problem.
If a node fails, what is the expected behaviour of the pacemaker to
recover the node with stonith enabled and without it? What is expected
from the sysadmin in this procedure?
Bence
On 02/13/2012 02:09 AM, Andrew Beekhof wrote:
On Sat, Feb 11, 2012 at 2:46 AM, Kiss Bence<be...@noc.elte.hu> wrote:
Hi,
On 01/30/2012 04:00 AM, Andrew Beekhof wrote:
On Thu, Jan 26, 2012 at 2:08 AM, Kiss Bence<be...@noc.elte.hu> wrote:
Hi,
I am newbie to the clustering and I am trying to build a two node
active/passive cluster based upon the documentation:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
My systems are Fedora 14, uptodate. After forming the cluster as wrote, I
started to test it. (resources: drbd-> lvm-> fs ->group of services)
Resources moved around, nodes rebooted and killed (first I tried it in
virtual environment then also on real machines).
After some events the two nodes ended up in a kind of state of
split-brain.
The crm_mon showed me that the other node is offline at both nodes
although
the drbd subsystem showed everything in sync and working. The network was
not the issue (ping, tcp and udp communications were fine). Nothing
changed
from the network view.
At first the rejoining took place quiet well, but some more events after
it
took longer and after more event it didn't. The network dump showed me
the
multicast packets still coming and going. At corosync (crm_node -l) the
other node didn't appeared both on them. After trying configuring the cib
logs was full of messages like "<the other node>: not in our membership".
That looks like a pacemaker bug.
Can you use crm_report to grab logs from about 30 minutes prior to the
first time you see this log until an hour after please?
Attach that to a bug in bugs.clusterlabs.org and i'll take a look
I had created a bug report: id 5031.
Perfect, I'll look there.
The "split-brain" is lasting every time about 5 minutes. Meanwhile the two
nodes think that the other node is dead. However the drbd is working fine,
and properly disallowing the second rebooted node to go Primary. The
crm_node -l shows only the local node.
Meanwhile one of my question is answered. The multicast issue was a local
network issue. The local netadmin fixed it. Now it works.
This issue seems to me similar to what James Flatten had reported at 8-th
Feb. ([Pacemaker] Question about cluster start-up in a 2 node cluster with a
node offline.)
The stonith-enabled="false" \
and no-quorum-policy="ignore"
Thanks in advance,
Bence
I tried to erase the config (crm configure erase, cibadmin -E -f) but it
worked only locally. I noticed that the pacemaker process didn't started
up
normally on the node that was booting after the other. I also tried to
remove files from /var/lib/pengine/ and /var/lib/hearbeat/crm/ but only
the
resources are gone. It didn't help on forming a cluster without
resources.
The pacemaker process exited some 20 minutes after it started. Manual
starting was the same.
After digging into google for answers I found nothing helpful. From
running
tips I changed in the /etc/corosync/service.d/pcmk file the version to
1.1
(this is the version of the pacemaker in this distro). I realized that
the
cluster processes were startup from corosync itself not by pacemaker.
Which
could be omitted. The cluster forming is stable after this change even
after
many many events.
Now I reread the document mentioned above, and I wonder why it wrote the
"Important notice" on page 37. What is wrong theoretically with my
scenario?
Having corosync start the daemons worked well for some but not others,
thus it was unreliable.
The notice points out a major difference between the two operating
modes so that people will not be caught by surprise when pacemaker
does not start.
Why does it working? Why didn't work the config suggested by the
document?
Tests were done firsth on virtual machines of a Fedora 14 (1 CPU core,
512Mb
ram, 10G disk, 1G drbd on logical volume, physical volume on drbd
forming
volgroup named cluster.)/node.
Then on real machines. They have more cpu cores (4), more RAM (4G) and
more
disk (mirrored 750G), 180G drbd, and 100M garanteed routed link between
the
nodes 5 hops away.
By the way how should one configure the corosync to work on multicast
routed
network? I had to create an openvpn tap link between the real nodes for
working. The original config with public IP-s didn't worked. Is corosync
equipped to cope with the multicast pim messages? Or it was a firewall
issue.
Thanks in advance,
Bence
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org