Re: [Pacemaker] Creating a safe cluster-node shutdown script (for when UPS goes OnBattery+LowBattery)

Tomas Jelinek Thu, 10 Jul 2014 01:31:30 -0700

Dne 10.7.2014 03:17, Giuseppe Ragusa napsal(a):

On Thu, Jul 10, 2014, at 00:06, Andrew Beekhof wrote:


On 9 Jul 2014, at 10:28 pm, Giuseppe Ragusa <giuseppe.rag...@hotmail.com> wrote:

On Tue, Jul 8, 2014, at 02:59, Andrew Beekhof wrote:


On 4 Jul 2014, at 3:16 pm, Giuseppe Ragusa <giuseppe.rag...@hotmail.com> wrote:

Hi all,
I'm trying to create a script as per subject (on CentOS 6.5, CMAN+Pacemaker, 
only DRBD+KVM active/passive resources; SNMP-UPS monitored by NUT).

Ideally I think that each node should stop (disable) all locally-running 
VirtualDomain resources (doing so cleanly demotes than downs the DRBD resources 
underneath), then put itself in standby and finally shutdown.


Since the end goal is shutdown, why not just run 'pcs cluster stop' ?


I thought that this action would cause communication interruption (since 
Corosync would be not responding to the peer) and so cause the other node to 
stonith us;


No. Shutdown is a globally co-ordinated process.
We don't fence nodes we know shut down cleanly.


Thanks for the clarification.
Now that you said it, it seems also logical and even obvious ;>

I know that ideally the other node too should perform "pcs cluster stop" in short, since 
the same UPS powers both, but I worry about timing issues (and "races") in UPS monitoring 
since it is a large Enterprise UPS monitored by SNMP.

Furthermore I do not know what happens to running resources at "pcs cluster 
stop": I infer from your suggestion that resources are brought down and not migrated 
on the other node, correct?


If the other node is shutting down too, they'll simply be stopped.
Otherwise we'll try to move them.


It's the "moving" that worries me :)

Possibly with 'pcs cluster standby' first if you're worried that stopping the 
resources might take too long.


I forgot to ask: in which way would a previous standby make the resources stop 
sooner?

I thought that "pcs cluster standby" would usually migrate the resources to the 
other node (I actually tried it and confirmed the expected behaviour); so this would risk 
to become a race with the timing of the other node standby,


Not really, at the point the second node runs 'standby' we'll stop trying to 
migrate services and just stop them everywhere.
Again, this is a centrally controlled process, timing isn't a problem.


I understand that, "eventually", timing won't be a problem and resources will 
"eventually" stop, but from your description I'm afraid that some delaying could result 
in the total shutdown process, arising from possibly unsynchronized UPS notifications on the nodes 
(first node starts standby, resources start to move, THEN second node starts standby).

So now I'm taking your advice and I'll modify the script to user cluster stop 
but, with the aim of avoiding the aforementioned delay (if it actually 
represents a possibility), I would like to ask you three questions:

Hi,

"pcs cluster stop --all" does not work on 6.5 with pcs-0.9.90 which Ibelieve is shipped with 6.5. You need to install current pcs versionfrom https://github.com/feist/pcs and run pcsd service on all nodes.

*) if I simply issue a "pcs cluster stop --all" from the first node that gets 
notified of UPS critical status, do I risk any adverse effect when the other node 
asynchronously gives the same command some time later (before/after the whole cluster 
stop sequence completes)?

It just runs "service pacemaker stop" and "service cman stop" on everynode. It should have no effect once the services are already stopped.


*) does the aforementioned "pcs cluster stop --all" command return only after the cluster 
stop sequence has actually/completely ended (so as to safely issue a "shutdown -h now" 
immediately afterwards)?

Yes. You need to check the return code, non-zero return code means someerror has occurred and some nodes haven't been stopped.When you shutdown one node and then try to run "pcs cluster stop --all"on the other one it will fail and return non-zero return code (obviously).


*) is the "pcs cluster stop --all" command known to work reliably on current CentOS 6.5? 
(I ask since I found some discussion around "pcs cluster start" related bugs)

See above.

Regards,
Tomas


Many thanks again for your invaluable help and insight.

Regards,
Giuseppe

so this is why I took the hassle of explicitly and orderly stopping all 
locally-running resources in my script BEFORE putting the local node in standby.

Pacemaker will stop everything in the required order and stop the node when 
done... problem solved?


I thought that after a "pcs cluster standby" a regular "shutdown -h" of the 
operating system would cleanly bring down the cluster too,


It should do

without the need for a "pcs cluster stop", given that both Pacemaker and CMAN 
are correctly configured for automatic startup/shutdown as operating system services 
(SysV initscripts controlled by CentOS 6.5 Upstart, in my case).

Many thanks again for your always thought-provoking and informative answers!

Regards,
Giuseppe


On further startup, manual intervention would be required to unstandby all 
nodes and enable resources (nodes already in standby and resources already 
disabled before blackout should be manually distinguished).

Is this strategy conceptually safe?

Unfortunately, various searches have turned out no "prior art" :)

This is my tentative script (consider it in the public domain):

------------------------------------------------------------------------------------------------------------------------------------
#!/bin/bash

# Note: "pcs cluster status" still has a small bug vs. CMAN-controlled Corosync 
and would always return != 0
pcs status > /dev/null 2>&1
STATUS=$?

# Detect if cluster is running at all on local node
# TODO: detect node already in standby and bypass this
if [ "${STATUS}" = 0 ]; then
    local_node="$(cman_tool status | grep -i 'Node[[:space:]]*name:' | sed -e 
's/^.*Node\s*name:\s*\([^[:space:]]*\).*$/\1/i')"
    for local_resource in $(pcs status 2>/dev/null | grep 
"ocf::heartbeat:VirtualDomain.*${local_node}\\s*\$" | awk '{print $1}'); do
        pcs resource disable "${local_resource}"
    done
    # TODO: each resource disabling above may return without waiting for complete stop - 
wait here for "no more resources active"? (but avoid endless loops)
    pcs cluster standby "${local_node}"
fi

# Shut down gracefully anyway at the end
/sbin/shutdown -h +0

------------------------------------------------------------------------------------------------------------------------------------

Comments/suggestions/improvements are more than welcome.

Many thanks in advance.

Regards,
Giuseppe

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Email had 1 attachment:
+ signature.asc
  1k (application/pgp-signature)

--
  Giuseppe Ragusa
  giuseppe.rag...@fastmail.fm


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Email had 1 attachment:
+ signature.asc
   1k (application/pgp-signature)


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Creating a safe cluster-node shutdown script (for when UPS goes OnBattery+LowBattery)

Reply via email to