Hi Kostya,
I'm having a little trouble understanding your question, sorry.
On boot, the node will not start anything, so after booting it, you
log in, check that it can talk to the peer node (a simple ping is
generally enough), then start the cluster. It will join the peer's
existing cluster (even if it's a cluster on just itself).
If you booted both nodes, say after a power outage, you will check
the connection (again, a simple ping is fine) and then start the cluster
on both nodes at the same time.
If one of the nodes needs to be shut down, say for repairs or
upgrades, you migrate the services off of it and over to the peer node,
then you stop the cluster (which tells the peer that the node is leaving
the cluster). After that, the remaining node operates by itself. When
you turn it back on, you rejoin the cluster and migrate the services back.
I think, maybe, you are looking at things more complicated than you
need to. Pacemaker and corosync will handle most of this for you, once
setup properly. What operating system do you plan to use, and what
cluster stack? I suspect it will be corosync + pacemaker, which should
work fine.
digimer
On 23/06/14 10:36 AM, Kostiantyn Ponomarenko wrote:
Hi Digimer,
Suppose I disabled to cluster on start up, but what about remaining
node, if I need to reboot it?
So, even in case of connection lost between these two nodes I need to
have one node working and providing resources.
How did you solve this situation?
Should it be a separate daemon which checks somehow connection between
the two nodes and decides to run corosync and pacemaker or to keep them
down?
Thank you,
Kostya
On Mon, Jun 23, 2014 at 4:34 PM, Digimer <li...@alteeve.ca
<mailto:li...@alteeve.ca>> wrote:
On 23/06/14 09:11 AM, Kostiantyn Ponomarenko wrote:
Hi guys,
I want to gather all possible configuration variants for 2-node
cluster,
because it has a lot of pitfalls and there are not a lot of
information
across the internet about it. And also I have some questions about
configurations and their specific problems.
VARIANT 1:
-----------------
We can use "two_node" and "wait_for_all" option from Corosync's
votequorum, and set up fencing agents with delay on one of them.
Here is a workflow(diagram) of this configuration:
1. Node start.
2. Cluster start (Corosync and Pacemaker) at the boot time.
3. Wait for all nodes. All nodes joined?
No. Go to step 3.
Yes. Go to step 4.
4. Start resources.
5. Split brain situation (something with connection between nodes).
6. Fencing agent on the one of the nodes reboots the other node
(there
is a configured delay on one of the Fencing agents).
7. Rebooted node go to step 1.
There are two (or more?) important things in this configuration:
1. Rebooted node remains waiting for all nodes to be visible
(connection
should be restored).
2. Suppose connection problem still exists and the node which
rebooted
the other guy has to be rebooted also (for some reasons). After
reboot
he is also stuck on step 3 because of connection problem.
QUESTION:
-----------------
Is it possible somehow to assign to the guy who won the reboot race
(rebooted other guy) a status like a "primary" and allow him not
to wait
for all nodes after reboot. And neglect this status after other node
joined this one.
So is it possible?
Right now that's the only configuration I know for 2 node cluster.
Other variants are very appreciated =)
VARIANT 2 (not implemented, just a suggestion):
-----------------
I've been thinking about using external SSD drive (or other external
drive). So for example fencing agent can reserve SSD using SCSI
command
and after that reboot the other node.
The main idea of this is the first node, as soon as a cluster
starts on
it, reserves SSD till the other node joins the cluster, after
that SCSI
reservation is removed.
1. Node start
2. Cluster start (Corosync and Pacemaker) at the boot time.
3. Reserve SSD. Did it manage to reserve?
No. Don't start resources (Wait for all).
Yes. Go to step 4.
4. Start resources.
5. Remove SCSI reservation when the other node has joined.
5. Split brain situation (something with connection between nodes).
6. Fencing agent tries to reserve SSD. Did it manage to reserve?
No. Maybe puts node in standby mode ...
Yes. Reboot the other node.
7. Optional: a single node can keep SSD reservation till he is
alone in
the cluster or till his shut-down.
I am really looking forward to find the best solution (or a
couple of
them =)).
Hope I am not the only person ho is interested in this topic.
Thank you,
Kostya
Hi Kostya,
I only build 2-node clusters, and I've not had problems with this
going back to 2009 over dozens of clusters. The tricks I found are:
* Disable quorum (of course)
* Setup good fencing, and add a delay to the node you you prefer (or
pick one at random, if equal value) to avoid dual-fences
* Disable to cluster on start up, to prevent fence loops.
That's it. With this, your 2-node cluster will be just fine.
As for your question; Once a node is fenced successfully, the
resource manager (pacemaker) will take over any services lost on the
fenced node, if that is how you configured it. A node the either
gracefully leaves or dies/fenced should not interfere with the
remaining node.
The problem is when a node vanishes and fencing fails. Then, not
knowing what the other node might be doing, the only safe option is
to block, otherwise you risk a split-brain. This is why fencing is
so important.
Cheers
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person
without access to education?
_________________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
<mailto:Pacemaker@oss.clusterlabs.org>
http://oss.clusterlabs.org/__mailman/listinfo/pacemaker
<http://oss.clusterlabs.org/mailman/listinfo/pacemaker>
Project Home: http://www.clusterlabs.org
Getting started:
http://www.clusterlabs.org/__doc/Cluster_from_Scratch.pdf
<http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org