Dear list, just to excuse the triviality - i started to deploy a ha environment in a test lab and therefore i do not have much experience.
i started to setup a 2-node cluster corosync-1.4.1-15.el6.x86_64 pacemaker-1.1.8-7.el6.x86_64 cman-3.0.12.1-49.el6.x86_64 with rhel6.3 and then switched to rhel6.4. This update brings some differences. The crm shell is gone and pcs is added. Anyway i found some equivalent commands to setup/configure resources. So far all good. I am doing some stress test now and noticed that rebooting one node (n2), that node (n2) will be marked as standby in the cib (shown on the other node (n1)). After rebooting the node (n2) crm_mon on that node shows that the other node (n1) is offline and begins to start the ressources. While the other node (n1) that wasn't rebooted still shows n2 as standby. At that point both nodes are runnnig the "same" resources. After a couple of minutes that situation is noticed and both nodes renegotiate the current state. Then one node take over the responsibility to provide the resources. On both nodes the previously rebooted node is still listed as standby. cat /var/log/messages |grep error Mar 4 17:32:33 cn1 pengine[1378]: error: native_create_actions: Resource resIP (ocf::IPaddr2) is active on 2 nodes attempting recovery Mar 4 17:32:33 cn1 pengine[1378]: error: native_create_actions: Resource resApache (ocf::apache) is active on 2 nodes attempting recovery Mar 4 17:32:33 cn1 pengine[1378]: error: process_pe_message: Calculated Transition 1: /var/lib/pacemaker/pengine/pe-error-6.bz2 Mar 4 17:32:48 cn1 crmd[1379]: notice: run_graph: Transition 1 (Complete=9, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-error-6.bz2): Complete crm_mon -1 Last updated: Mon Mar 4 17:49:08 2013 Last change: Mon Mar 4 10:22:53 2013 via crm_resource on cn1.localdomain Stack: cman Current DC: cn1.localdomain - partition with quorum Version: 1.1.8-7.el6-394e906 2 Nodes configured, 2 expected votes 2 Resources configured. Node cn2.localdomain: standby Online: [ cn1.localdomain ] resIP (ocf::heartbeat:IPaddr2): Started cn1.localdomain resApache (ocf::heartbeat:apache): Started cn1.localdomain i checked the init scripts and found that the standby "behavior" comes from a function that is called on "service pacemaker stop" (added in rhel6.4). cman_pre_stop() { cname=`crm_node --name` crm_attribute -N $cname -n standby -v true -l reboot echo -n "Waiting for shutdown of managed resources" ... i could not delete the standby attribute with crm_attribute -G --node=cn2.localdomain -n standby okay - recap: 1st. i have this delay where the two nodes dont see each other (after rebooting) and the result are resources running on both nodes while they should only run on one node - this will be corrected by the cluster it self but this situation should not happen. 2nd. the standby attribute (and there must be a reason why redhat added this) will prevent to migrate resources to that node. How do i delete this attribute? i appreciate any comments -- Leon A. $ cat /etc/cluster/cluster.conf <?xml version="1.0"?> <cluster name="HA" config_version="5"> <logging debug="off"/> <clusternodes> <clusternode name="cn1.localdomain" votes="1" nodeid="1"> <fence> <method name="pcmk-redirect"> <device name="pcmk" port="cn1.localdomain"/> </method> </fence> </clusternode> <clusternode name="cn2.localdomain" votes="1" nodeid="2"> <fence> <method name="pcmk-redirect"> <device name="pcmk" port="cn2.localdomain"/> </method> </fence> </clusternode> </clusternodes> <fencedevices> <fencedevice name="pcmk" agent="fence_pcmk"/> </fencedevices> <rm> <failoverdomains/> <resources/> </rm> </cluster> B. $ pcs config Corosync Nodes: Pacemaker Nodes: cn1.localdomain cn2.localdomain Resources: Resource: resIP (provider=heartbeat type=IPaddr2 class=ocf) Attributes: ip=192.168.201.220 nic=eth0 cidr_netmask=24 Operations: monitor interval=30s Resource: resApache (provider=heartbeat type=apache class=ocf) Attributes: httpd=/usr/sbin/httpd configfile=/etc/httpd/conf/httpd.conf Operations: monitor interval=1min Location Constraints: Ordering Constraints: start resApache then start resIP Colocation Constraints: resIP with resApache Cluster Properties: dc-version: 1.1.8-7.el6-394e906 cluster-infrastructure: cman expected-quorum-votes: 2 stonith-enabled: false no-quorum-policy: ignore _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org