On Tue, Mar 5, 2013 at 4:20 AM, Leon Fauster <leonfaus...@googlemail.com> wrote: > Dear list, > > just to excuse the triviality - i started to deploy a ha environment > in a test lab and therefore i do not have much experience. > > > > i started to setup a 2-node cluster > > corosync-1.4.1-15.el6.x86_64 > pacemaker-1.1.8-7.el6.x86_64 > cman-3.0.12.1-49.el6.x86_64 > > with rhel6.3 and then switched to rhel6.4. > > This update brings some differences. The crm shell is gone and pcs is added. > Anyway i found some equivalent commands to setup/configure resources. > > So far all good. I am doing some stress test now and noticed that rebooting > one node (n2), that node (n2) will be marked as standby in the cib (shown on > the > other node (n1)). > > After rebooting the node (n2) crm_mon on that node shows that the other node > (n1) > is offline and begins to start the ressources. While the other node (n1) that > wasn't > rebooted still shows n2 as standby. At that point both nodes are runnnig the > "same" > resources. After a couple of minutes that situation is noticed and both nodes > renegotiate the current state. Then one node take over the responsibility to > provide > the resources. On both nodes the previously rebooted node is still listed as > standby. > > > cat /var/log/messages |grep error > Mar 4 17:32:33 cn1 pengine[1378]: error: native_create_actions: > Resource resIP (ocf::IPaddr2) is active on 2 nodes attempting recovery > Mar 4 17:32:33 cn1 pengine[1378]: error: native_create_actions: > Resource resApache (ocf::apache) is active on 2 nodes attempting recovery > Mar 4 17:32:33 cn1 pengine[1378]: error: process_pe_message: Calculated > Transition 1: /var/lib/pacemaker/pengine/pe-error-6.bz2 > Mar 4 17:32:48 cn1 crmd[1379]: notice: run_graph: Transition 1 > (Complete=9, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-error-6.bz2): Complete > > > crm_mon -1 > Last updated: Mon Mar 4 17:49:08 2013 > Last change: Mon Mar 4 10:22:53 2013 via crm_resource on cn1.localdomain > Stack: cman > Current DC: cn1.localdomain - partition with quorum > Version: 1.1.8-7.el6-394e906 > 2 Nodes configured, 2 expected votes > 2 Resources configured. > > Node cn2.localdomain: standby > Online: [ cn1.localdomain ] > > resIP (ocf::heartbeat:IPaddr2): Started cn1.localdomain > resApache (ocf::heartbeat:apache): Started cn1.localdomain > > > i checked the init scripts and found that the standby "behavior" comes > from a function that is called on "service pacemaker stop" (added in rhel6.4). > > cman_pre_stop() > { > cname=`crm_node --name` > crm_attribute -N $cname -n standby -v true -l reboot > echo -n "Waiting for shutdown of managed resources" > ...
That will only last until the node comes back (the cluster will remove it automatically), the core problem is that it appears not to have. Can you file a bug and attach a crm_report for the period covered by the restart? > > i could not delete the standby attribute with > > crm_attribute -G --node=cn2.localdomain -n standby > > > > okay - recap: > > 1st. i have this delay where the two nodes dont see each > other (after rebooting) and the result are resources running on both > nodes while they should only run on one node - this will be corrected > by the cluster it self but this situation should not happen. > > 2nd. the standby attribute (and there must be a reason why redhat > added this) will prevent to migrate resources to that node. How > do i delete this attribute? > > i appreciate any comments > > -- > Leon > > > > A. $ cat /etc/cluster/cluster.conf > <?xml version="1.0"?> > <cluster name="HA" config_version="5"> > <logging debug="off"/> > <clusternodes> > <clusternode name="cn1.localdomain" votes="1" nodeid="1"> > <fence> > <method name="pcmk-redirect"> > <device name="pcmk" port="cn1.localdomain"/> > </method> > </fence> > </clusternode> > <clusternode name="cn2.localdomain" votes="1" nodeid="2"> > <fence> > <method name="pcmk-redirect"> > <device name="pcmk" port="cn2.localdomain"/> > </method> > </fence> > </clusternode> > </clusternodes> > <fencedevices> > <fencedevice name="pcmk" agent="fence_pcmk"/> > </fencedevices> > <rm> > <failoverdomains/> > <resources/> > </rm> > </cluster> > > > B. $ pcs config > Corosync Nodes: > > Pacemaker Nodes: > cn1.localdomain cn2.localdomain > > Resources: > Resource: resIP (provider=heartbeat type=IPaddr2 class=ocf) > Attributes: ip=192.168.201.220 nic=eth0 cidr_netmask=24 > Operations: monitor interval=30s > Resource: resApache (provider=heartbeat type=apache class=ocf) > Attributes: httpd=/usr/sbin/httpd configfile=/etc/httpd/conf/httpd.conf > Operations: monitor interval=1min > > Location Constraints: > Ordering Constraints: > start resApache then start resIP > Colocation Constraints: > resIP with resApache > > Cluster Properties: > dc-version: 1.1.8-7.el6-394e906 > cluster-infrastructure: cman > expected-quorum-votes: 2 > stonith-enabled: false > no-quorum-policy: ignore > > > > > > > > > > > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org