(I'm a bit confused because I received an auto-reply form pacemaker-boun...@oss.clusterlabs.org saying this list is inactive now but I just received a digest with my mail. I happens that I have resent the email to the new list with a bit more information, which was missing in the first message. So here it is that extra bit, anyway).
I also have noticed this pattern (with both STONITH resources running): 1. With the cluster running without errors, I run "stop docker" in node cluster-a-1. 2. This leads to the vCenter STONITH to act as expected. 3. After the cluster is running again without errors, I run again "stop docker" in node cluster-a-1. 4. Now, the vCenter STONITH doesn't run and, instead, it is the IPMI STONITH that runs. This is unexpected for me, as I was expecting to see the vCenter STONITH to run again. On Wed, Apr 8, 2015 at 4:20 PM, Jorge Lopes <jmclo...@gmail.com> wrote: > Hi all. > > I'm having difficulties orchestrating two STONITH devices in my cluster. I > have been struggling with this in past days and I need some help, please. > > A simplified version of my cluster and its goals is as follows: > - The cluster has two physical servers, each with two nodes (VMWare > virtual machines): overall, there are 4 nodes in this simplified version. > - There are two resource groups: group-cluster-a and group-cluster-b. > - To achieve a good CPU balance in the physical servers, the cluster is > asymmetric, with one group running in one server and the other group > running on the other server. > - If the VM of one host becomes not usable, then its resources are started > in its sister VM deployed in the other physical host. > - If one physical host becomes not usable, then all resources are started > in the other physical host. > - Two STONITH levels are used to fence the problematic nodes. > > The resources have the following behavior: > - If the resource monitor detects a problem, then Pacemaker tries to > restart the resource in the same node. > - If it fails, then STONITH takes place (vcenter reboots the VM) and > Pacemaker starts the resource in the sister VM present in the other > physical host. > - If restarting the VM fails, I want to power off the physical server and > Pacemaker will start all resources in the other physical host. > > > The HA stack is: > Ubuntu 14.04 (the node OS, which is a visualized guest running in VMWare > ESXi 5.5) > Pacemaker 1.1.12 > Corosync 2.3.4 > CRM 2.1.2 > > The 4 nodes are: > cluster-a-1 > cluster-a-2 > cluster-b-1 > cluster-b-2 > > The relevant configuration is: > > property symmetric-cluster=false > property stonith-enabled=true > property no-quorum-policy=stop > > group group-cluster-a vip-cluster-a docker-web > location loc-group-cluster-a-1 group-cluster-a inf: cluster-a-1 > location loc-group-cluster-a-2 group-cluster-a 500: cluster-a-2 > > group group-cluster-b vip-cluster-b docker-srv > location loc-group-cluster-b-1 group-cluster-b 500: cluster-b-1 > location loc-group-cluster-b-2 group-cluster-b inf: cluster-b-2 > > > # stonith vcenter definitions for host 1 > # run in any of the host2 VM > primitive stonith-vcenter-host1 stonith:external/vcenter \ > params \ > VI_SERVER="192.168.40.20" \ > VI_CREDSTORE="/etc/vicredentials.xml" \ > HOSTLIST="cluster-a-1=cluster-a-1;cluster-a-2=cluster-a-2" \ > RESETPOWERON="1" \ > priority="2" \ > pcmk_host_check="static-list" \ > pcmk_host_list="cluster-a-1 cluster-a-2" \ > op monitor interval="60s" > > location loc1-stonith-vcenter-host1 stonith-vcenter-host1 500: cluster-b-1 > location loc2-stonith-vcenter-host1 stonith-vcenter-host1 501: cluster-b-2 > > # stonith vcenter definitions for host 2 > # run in any of the host1 VM > primitive stonith-vcenter-host2 stonith:external/vcenter \ > params \ > VI_SERVER="192.168.40.21" \ > VI_CREDSTORE="/etc/vicredentials.xml" \ > HOSTLIST="cluster-b-1=cluster-b-1;cluster-b-2=cluster-b-2" \ > RESETPOWERON="1" \ > priority="2" \ > pcmk_host_check="static-list" \ > pcmk_host_list="cluster-b-1 cluster-b-2" \ > op monitor interval="60s" > > location loc1-stonith-vcenter-host2 stonith-vcenter-host2 500: cluster-a-1 > location loc2-stonith-vcenter-host2 stonith-vcenter-host2 501: cluster-a-2 > > > # stonith IPMI definitions for host 1 (DELL with iDRAC 7 enterprise > interface at 192.168.40.15) > # run in any of the host2 VM > primitive stonith-ipmi-host1 stonith:external/ipmi \ > params hostname="host1" ipaddr="192.168.40.15" userid="root" > passwd="mypassword" interface="lanplus" \ > priority="1" \ > pcmk_host_check="static-list" \ > pcmk_host_list="cluster-a-1 cluster-a-2" \ > op start interval="0" timeout="60s" requires="nothing" \ > op monitor interval="3600s" timeout="20s" requires="nothing" > > location loc1-stonith-ipmi-host1 stonith-ipmi-host1 500: cluster-b-1 > location loc2-stonith-ipmi-host1 stonith-ipmi-host1 501: cluster-b-2 > > > # stonith IPMI definitions for host 2 (DELL with iDRAC 7 enterprise > interface at 192.168.40.16) > # run in any of the host1 VM > primitive stonith-ipmi-host2 stonith:external/ipmi \ > params hostname="host2" ipaddr="192.168.40.16" userid="root" > passwd="mypassword" interface="lanplus" \ > priority="1" \ > pcmk_host_check="static-list" \ > pcmk_host_list="cluster-b-1 cluster-b-2" \ > op start interval="0" timeout="60s" requires="nothing" \ > op monitor interval="3600s" timeout="20s" requires="nothing" > > location loc1-stonith-ipmi-host2 stonith-ipmi-host2 500: cluster-a-1 > location loc2-stonith-ipmi-host2 stonith-ipmi-host2 501: cluster-a-2 > > > What is working: > - When an error is detected in one resource, the resource restart in the > same node, as expected. > - With the STONITH external/ipmi resource *stopped*, a fail in one node > makes the vcenter rebooting it and the resources starts in the sister node. > > > What is not so good: > - When vcenter reboots one node, then the resource starts in the other > node as expected but then they return to the original node as soon as it > becomes online. This makes a bit of ping-pong and I think it is a > consequence of how the locations are defined. Any suggestion to avoid this? > After the resource was moved to another node, I would prefer that it stays > there, instead of returning it to the original node. I can think of playing > with the resource affinity scores - is this way it should be done? > > What is wrong: > Lets consider this scenario. > I have a set of resources provided by a docker agent. My test consists in > stopping the docker service in the node cluster-a-1, which makes the docker > agent to return OCF_ERR_INSTALLED to Pacemaker (this is a change I made in > the docker agent, when compared to the github repository version). With the > IPMI STONITH resource stopped, this leads to the node cluster-a-1 restart, > which is expected. > > But with the IPMI STONITH resource started, I notice an erratic behavior: > - Some times, the resources at the node cluster-a-1 are stopped and no > STONITH happens. Also, the resources are not moved to the node cluster-a-2. > In this situation, if I manually restart the node cluster-a-1 (virtual > machine restart), then the IPMI STONITH takes place and restarts the > corresponding physical server. > - Sometimes, the IPMI STONITH starts before the vCenter STONITH, which is > not expected because the vCenter STONITH has higher priority. > > I might have something wrong in my stonith definition, but I can't figure > what. > Any idea how to correct this? > > And how can I set external/ipmi to power off the physical host, instead of > rebooting it? > >
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org