[Linux-HA] Trying to get Corosync to work with impaired three-node cluster.

Alex Sudakar Tue, 22 Jan 2013 19:22:26 -0800

Hi.  I have a simple two-node cluster which serves up a web
application in an active/passive configuration.  Each node is in a
separate data center.  The cluster is running Pacemaker and Corosync
with Red Hat Enterprise Linux 6.2.  I've performed some experiments
with the Red Hat version 1.4.1 of Corosync that is delivered with RHEL
6.2 and also with my own compiled version 2.1.


More and more I'm liking the idea of having a third node as a quorum
'tie breaker'.  This would be a 'minimalist' node, unable to run the
application, just to be used for quorum membership calculations.  The
third node would sit in the same network as the actual users'
desktops.  So connectivity to the third node would signify successful
delivery of the application to the users from the cluster.

I'm using unicast UDP for my cluster connectivity with Corosync.

If the two application ('real') nodes are A & B, and the 'tie breaker'
C, I was wondering what would happen in this scenario:

.  link A-B is down
.  link A-C is up
.  link B-C is up

In such a situation I thought that, hopefully, all the nodes would
still 'see' each other, all three in the one cluster partition, with C
relaying knowledge of A & B to each other.  That's why, I thought,
Corosync calls its circuits 'rings', after all; presumably nodes don't
have to have direct links to every other node?

But I seem to have discovered a bug with Corosync 1.4.1.  With
Corosync down on all three nodes and the link between A & B 'broken'
(with IPTables rules on B rejecting A) I do the following:

-  Start Corosync on A.  A shows itself as the sole member of its ring
('corosync-objctl | fgrep member'); the cluster log shows the normal
Corosync messages and then silence.

-  Start Corosync on B.  Both A and B think they are the sole members
of their rings.  Log activity is normal; a burst of messages at the
start of Corosync and then silence.

-  Start Corosync on C.  Weird behavior ensues:

a)  A still believes its the only member of its ring, but every 4
seconds it outputs a burst of messages:

Jan 23 11:56:48 nodeA corosync[5344]:   [pcmk  ] notice:
pcmk_peer_update: Transitional membership event on ring 24812: memb=1,
new=0, lost=0
Jan 23 11:56:48 nodeA corosync[5344]:   [pcmk  ] info:
pcmk_peer_update: memb: nodeA 832357898
Jan 23 11:56:48 nodeA corosync[5344]:   [pcmk  ] notice:
pcmk_peer_update: Stable membership event on ring 24812: memb=1,
new=0, lost=0
Jan 23 11:56:48 nodeA corosync[5344]:   [pcmk  ] info:
pcmk_peer_update: MEMB: nodeA 832357898
Jan 23 11:56:48 nodeA corosync[5344]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Jan 23 11:56:48 nodeA corosync[5344]:   [CPG   ] chosen downlist:
sender r(0) ip(10.198.156.49) ; members(old:1 left:0)
Jan 23 11:56:48 nodeA corosync[5344]:   [MAIN  ] Completed service
synchronization, ready to provide service.

The ring numbers increase in leaps with each 4-second burst - ring
24452, 24544, 24632, 24724.

b)  Nodes B & C show membership lists of those two nodes - B & C - in
their local rings, but a huge flurry of almost-continuous messages ...
40 or 50 'sets' like this on nodeB:

Jan 23 11:56:33 nodeB corosync[9397]:   [CPG   ] chosen downlist:
sender r(0) ip(10.198.156.46) ; members(old:2 left:0)
Jan 23 11:56:33 nodeB corosync[9397]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Jan 23 11:56:33 nodeB corosync[9397]:   [pcmk  ] notice:
pcmk_peer_update: Transitional membership event on ring 24384: memb=2,
new=0, lost=0
Jan 23 11:56:33 nodeB corosync[9397]:   [pcmk  ] info:
pcmk_peer_update: memb: nodeC 782026250
Jan 23 11:56:33 nodeB corosync[9397]:   [pcmk  ] info:
pcmk_peer_update: memb: nodeB 1687995914
Jan 23 11:56:33 nodeB corosync[9397]:   [pcmk  ] notice:
pcmk_peer_update: Stable membership event on ring 24384: memb=2,
new=0, lost=0
Jan 23 11:56:33 nodeB corosync[9397]:   [pcmk  ] info:
pcmk_peer_update: MEMB: nodeC 782026250
Jan 23 11:56:33 nodeB corosync[9397]:   [pcmk  ] info:
pcmk_peer_update: MEMB: nodeB 1687995914
Jan 23 11:56:33 nodeB corosync[9397]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Jan 23 11:56:33 nodeB corosync[9397]:   [CPG   ] chosen downlist:
sender r(0) ip(10.198.156.46) ; members(old:2 left:0)
Jan 23 11:56:33 nodeB corosync[9397]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Jan 23 11:56:33 nodeB corosync[9397]:   [pcmk  ] notice:
pcmk_peer_update: Transitional membership event on ring 24388: memb=2,
new=0, lost=0
Jan 23 11:56:33 nodeB corosync[9397]:   [pcmk  ] info:
pcmk_peer_update: memb: nodeC 782026250
Jan 23 11:56:33 nodeB corosync[9397]:   [pcmk  ] info:
pcmk_peer_update: memb: nodeB 1687995914
Jan 23 11:56:33 nodeB corosync[9397]:   [pcmk  ] notice:
pcmk_peer_update: Stable membership event on ring 24388: memb=2,
new=0, lost=0
Jan 23 11:56:33 nodeB corosync[9397]:   [pcmk  ] info:
pcmk_peer_update: MEMB: nodeC 782026250
Jan 23 11:56:33 nodeB corosync[9397]:   [pcmk  ] info:
pcmk_peer_update: MEMB: nodeB 1687995914
Jan 23 11:56:33 nodeB corosync[9397]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.

Where the above is two 'sets', each set starting with the 'chosen
downlist' line and finishing with 'A processor joined or left'.

And then finishing with some 'flushing' lines:

Jan 23 11:56:34 nodeB corosync[9397]:   [TOTEM ] Process pause
detected for 1165 ms, flushing membership messages.
Jan 23 11:56:34 nodeB corosync[9397]:   [TOTEM ] Process pause
detected for 1165 ms, flushing membership messages.
Jan 23 11:56:34 nodeB corosync[9397]:   [TOTEM ] Process pause
detected for 1224 ms, flushing membership messages.
Jan 23 11:56:34 nodeB corosync[9397]:   [TOTEM ] Process pause
detected for 1284 ms, flushing membership messages.
Jan 23 11:56:34 nodeB corosync[9397]:   [TOTEM ] Process pause
detected for 1343 ms, flushing membership messages.

Both B & C are doing this.  Corosync doesn't seem well at all.

Which is disappointing.  I'd hoped that the 'ring' would keep all
three members in a single cluster partition.  But even if it didn't,
Corosync's continuous streaming of messages for the two nodes of the
two-node partition seems to be broken behavior.

I repeated this experiment with another three virtual machines running
a version 2.1 of Corosync that I'd compiled, and got different
results.  In this case - using corosync-quorumtool to report on
ring/quorum membership, is that correct? - and starting Corosync in
the sequence A, B then C, each node saw only itself in its
ring/quorum.  A & B output a set of three lines in the log every 1-2
seconds:

Jan 23 12:14:18 nodeA corosync[20068]:   [QUORUM] Members[1]: 1
Jan 23 12:14:18 nodeA corosync[20068]:   [TOTEM ] A processor joined
or left the membership and a new membership (10.198.156.47:6388) was
formed.
Jan 23 12:14:18 nodeA corosync[20068]:   [MAIN  ] Completed service
synchronization, ready to provide service.

But node C showed the 'runaway' behavior that both B & C had displayed
with the older version of Corosync; not as verbose, but up to ten
3-line sets like the above, then a number of lines like this:

Jan 23 12:14:46 nodeC corosync[2430]:   [TOTEM ] Process pause
detected for 600 ms, flushing membership messages.
Jan 23 12:14:46 nodeC corosync[2430]:   [TOTEM ] Process pause
detected for 600 ms, flushing membership messages.
Jan 23 12:14:46 nodeC corosync[2430]:   [TOTEM ] Process pause
detected for 650 ms, flushing membership messages.
Jan 23 12:14:46 nodeC corosync[2430]:   [TOTEM ] Process pause
detected for 700 ms, flushing membership messages.
Jan 23 12:14:46 nodeC corosync[2430]:   [TOTEM ] Process pause
detected for 700 ms, flushing membership messages.
Jan 23 12:14:46 nodeC corosync[2430]:   [TOTEM ] Process pause
detected for 700 ms, flushing membership messages.

After a few minutes - while writing this :) - A & C joined up in a
single partition; with both A & C then producing the near-continuous
'flurry' of messages that I described for just C above.  So it looks
like Corosync 2.1 ended up doing much the same thing as 1.4.1.

Both clusters stabilize beautifully, forming a single partition of all
three nodes, as soon as I remove the IPtables rules that 'broke' the
link between A & B.

I'm at a loss as to what behavior I should expect.  I'd hoped that
Corosync's 'rings' would keep nodes in the same partition even if
there weren't direct links between them.  If not, I was wondering how
it would decide which node ended up with which - if A & B can both
connect to C, which one does C 'pick'?

But it looks like Corosync can't handle the situation at all?  Was
Corosync designed to expect direct connectivity/links between all
nodes in a cluster?

What do clusters do when there are partial network failures?  Links
breaking between one pair, but not all, nodes?  Is this the secret of
STONITH's widespread acceptance ... nodes are killed so Corosync won't
misbehave on them?  :)

I would really appreciate any pointers on this ... I'm not sure how I
should set up my cluster if I don't understand what I can expect
Corosync to do under such circumstances.

Thank you very much!
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Trying to get Corosync to work with impaired three-node cluster.

Reply via email to