Yes sir, the slony resource problem is solved.. I added another resource and give colocation to my cluster-ip. Thus my first slon-script (resource) is not influencing the functioning of the second one. I have also added a cleanup command at the end of each slony-resource --- let it be there!!!..
My slony resource snippet. ___________________________________________ #!/bin/bash logger $0 called with $1 HOSTNAME=`uname -n` NODE1_HOST=192.168.10.129 NODE2_HOST=192.168.10.130 slony_USER=postgres slony_PASSWORD=hcl123 DATABASE_NAME=test2 CLUSTER_NAME=cluster2 PRIMARY_NAME=node1 PORT=5432 NUM=1 # # Returns 1 (TRUE) If the local database is the master # is_master () { export PGPASSWORD=$slony_PASSWORD RESULT=`psql $DATABASE_NAME -h 127.0.0.1 --user $slony_USER -q -t <<_EOF_ SELECT count(*) FROM _$CLUSTER_NAME.sl_set WHERE set_origin=_$CLUSTER_NAME.getlocalnodeid('_$CLUSTER_NAME'); _EOF_` return $RESULT; } case "$1" in start) # start commands go here is_master; IS_MASTER=$? if [ $IS_MASTER -eq 1 ]; then #Already the master. Nothing to do here. echo "The local database is already the master" exit 0; fi if [ "$HOSTNAME" == "$PRIMARY_NAME" ]; then OLD_MASTER=2 OLD_SLAVE=1 else OLD_MASTER=1 OLD_SLAVE=2 fi if [ $NUM -eq 1 ]; then /usr/lib/postgresql/8.4/bin/slonik<<_EOF_ cluster name=$CLUSTER_NAME; node 1 admin conninfo = 'dbname=$DATABASE_NAME host=$NODE1_HOST user=$slony_USER port=$PORT password=$slony_PASSWORD'; node 2 admin conninfo = 'dbname=$DATABASE_NAME host=$NODE2_HOST user=$slony_USER port=$PORT password=$slony_PASSWORD'; failover (id=$OLD_MASTER, backup node = $OLD_SLAVE); _EOF_ sleep 8s crm resource cleanup slony-fail2 node2 echo "Done" fi #fi; ;; stop) # stop commands go here ;; status) # status commands go here # If LOCALHOST reports itself as the master then status is 0 # otherwise status is 3 is_master; RESULT=$? if [ "$RESULT" -eq "1" ]; then echo "Local database is the master" exit 0 else echo "Local database is a slave" exit 3 fi ;; esac ____________________________________________________________ But sir I am still afraid of the cluster-ip. I am not getting any error messages other than Feb 19 10:14:43 node2 NetworkManager: <info> (eth0): carrier now OFF (device state 1) And when I give #ip addr show (I am in node2 now) ____________________________________ 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN qlen 100 link/ether 00:07:e9:2e:5c:26 brd ff:ff:ff:ff:ff:ff inet 192.168.10.130/24 brd 192.168.10.255 scope global eth0 inet 192.168.10.10/24 brd 192.168.10.255 scope global secondary eth0 inet6 fe80::207:e9ff:fe2e:5c26/64 scope link valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 00:07:e9:2e:5c:27 brd ff:ff:ff:ff:ff:ff inet 192.168.1.2/24 brd 192.168.1.255 scope global eth1 inet6 fe80::207:e9ff:fe2e:5c27/64 scope link valid_lft forever preferred_lft forever _____________________________________________ And when I connect the cable back my ha-log wll show Feb 19 10:23:29 node2 NetworkManager: <info> (eth0): carrier now ON (device state 1) eth0 is my cluster-ip interface-- 192.168.10.10 is the cluster-ip. and rth1 is my heartbeat interface connected via cross-over cable to node1. sir, how can I confirm that the heartbeat-pacemaker is detecting it as a resource failure? Then only fencing take place no? With lots of thanks and regards, Jayakrishnan.L On Fri, Feb 19, 2010 at 1:17 AM, Dejan Muhamedagic <deja...@fastmail.fm>wrote: > Hi, > > On Thu, Feb 18, 2010 at 09:53:13PM +0530, Jayakrishnan wrote: > > Sir, > > > > You got the point...... I thought that there is some mistakes in my > > configurations after 2 weeks trying... If I enable stonith and manage to > > shut down the failed machine, most of my problems will be solved.. Now I > > feel much confident... > > Good. > > > But sir I need to clear the resource failures for my slony_failover > script.. > > Because when the slony failover takes place it will give a warning > message > > stating > > > > Feb 16 14:50:01 node1 lrmd: [2477]: info: RA output: > > (slony-fail:start:stderr) <stdin>:4: NOTICE: failedNode: set 1 has no > other > > direct receivers - move now > > > > to stderr or stdout and this warning messages are treated as resource > > failures by heartbeat-pacemaker. > > It is treated as a failure because the script exited with an > error code. Whether that makes sense or not, I don't know, it's > up to the script. This being a LSB init script, you should > evaluate it and test it thoroughly: unfortunately they are > usually not robust enough for use in clusters. > > > So If I want to add another script for > > second database failover, I am afraid the first script may block the > > execution of the second.. Now I have only one database replication for > > testing and the slony-failover script is running last while failovers.. > > You lost me. I guess that "add another script" means "another > resource". Well, if the two are independent, then they can't > influence each other (unless the stop action fails in which case > the node is fenced). > > > And I still dont believe that I am chatting with the person who made the > > "crm-cli". > > Didn't know that the cli is so famous. > > Thanks, > > Dejan > > > > > ---------- Forwarded message ---------- > > From: Dejan Muhamedagic <deja...@fastmail.fm> > > Date: Thu, Feb 18, 2010 at 9:02 PM > > Subject: Re: [Pacemaker] Need help!!! resources fail-over not taking > place > > properly... > > To: pacemaker@oss.clusterlabs.org > > > > > > Hi, > > > > On Thu, Feb 18, 2010 at 08:22:26PM +0530, Jayakrishnan wrote: > > > Hello Dejan, > > > > > > First of all thank you very much for your reply. I found that one of my > > node > > > is having the permission problem. There the permission of > /var/lib/pengine > > > file was set to "999:999" I am not sure how!!!!!! However i changed > it... > > > > > > sir, when I pull out the interface cable i am getting only this log > > message: > > > > > > Feb 18 16:55:58 node2 NetworkManager: <info> (eth0): carrier now OFF > > (device > > > state 1) > > > > > > And the resource ip is not moving any where at all. It is still there > in > > the > > > same machine... I acn view that the IP is still assigned to the eth0 > > > interface via "# ip addr show", even though the interface status is > > 'down.'. > > > Is this the split-brain?? If so how can I clear it?? > > > > With fencing (stonith). Please read some documentation available > > here: http://clusterlabs.org/wiki/Documentation > > > > > Because of the on-fail=standy in pgsql part in my cib I am able to do a > > > failover to another node when I manuallyu stop the postgres service in > tha > > > active machine. however even after restarting the postgres service via > > > "/etc/init.d/postgresql-8.4 start " I have to run > > > crm resource cleanup <pgclone> > > > > Yes, that's necessary. > > > > > to make the crm_mon or cluster identify that the service on. Till then > It > > is > > > showing as a failed action > > > > > > crm_mon snippet > > > -------------------------------------------------------------------- > > > Last updated: Thu Feb 18 20:17:28 2010 > > > Stack: Heartbeat > > > Current DC: node2 (3952b93e-786c-47d4-8c2f-a882e3d3d105) - partition > with > > > quorum > > > > > > Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56 > > > 2 Nodes configured, unknown expected votes > > > 3 Resources configured. > > > ============ > > > > > > Node node2 (3952b93e-786c-47d4-8c2f-a882e3d3d105): standby (on-fail) > > > Online: [ node1 ] > > > > > > vir-ip (ocf::heartbeat:IPaddr2): Started node1 > > > slony-fail (lsb:slony_failover): Started node1 > > > Clone Set: pgclone > > > Started: [ node1 ] > > > Stopped: [ pgsql:0 ] > > > > > > Failed actions: > > > pgsql:0_monitor_15000 (node=node2, call=33, rc=7, status=complete): > > not > > > running > > > > > > -------------------------------------------------------------------------------- > > > > > > Is there any way to run crm resource cleanup <resource> periodically?? > > > > Why would you want to do that? Do you expect your resources to > > fail regularly? > > > > > I dont know if there is any mistake in pgsql ocf script sir.. I have > given > > > all parameters correctly but its is giving an error " syntax error" all > > the > > > time when I use it.. > > > > Best to report such a case, it's either a configuration problem > > (did you read its metadata) or perhaps a bug in the RA. > > > > Thanks, > > > > Dejan > > > > > I put the same meta attributes as for the current lsb > > > as shown below... > > > > > > Please help me out... should I reinstall the nodes again?? > > > > > > > > > On Thu, Feb 18, 2010 at 6:50 PM, Dejan Muhamedagic < > deja...@fastmail.fm > > >wrote: > > > > > > > Hi, > > > > > > > > On Thu, Feb 18, 2010 at 05:09:09PM +0530, Jayakrishnan wrote: > > > > > sir, > > > > > > > > > > I have set up a two node cluster in Ubuntu 9.1. I have added a > > cluster-ip > > > > > using ocf:heartbeat:IPaddr2, clonned lsb script "postgresql-8.4" > and > > also > > > > > added a manually created script for slony database replication. > > > > > > > > > > Now every thing works fine but I am not able to use the ocf > resource > > > > > scripts. I mean fail over is not taking place or else even resource > is > > > > not > > > > > even taking. My ha.cf file and cib configuration is attached with > this > > > > mail > > > > > > > > > > My ha.cf file > > > > > > > > > > autojoin none > > > > > keepalive 2 > > > > > deadtime 15 > > > > > warntime 5 > > > > > initdead 64 > > > > > udpport 694 > > > > > bcast eth0 > > > > > auto_failback off > > > > > node node1 > > > > > node node2 > > > > > crm respawn > > > > > use_logd yes > > > > > > > > > > > > > > > My cib.xml configuration file in cli format: > > > > > > > > > > node $id="3952b93e-786c-47d4-8c2f-a882e3d3d105" node2 \ > > > > > attributes standby="off" > > > > > node $id="ac87f697-5b44-4720-a8af-12a6f2295930" node1 \ > > > > > attributes standby="off" > > > > > primitive pgsql lsb:postgresql-8.4 \ > > > > > meta target-role="Started" resource-stickness="inherited" \ > > > > > op monitor interval="15s" timeout="25s" on-fail="standby" > > > > > primitive slony-fail lsb:slony_failover \ > > > > > meta target-role="Started" > > > > > primitive vir-ip ocf:heartbeat:IPaddr2 \ > > > > > params ip="192.168.10.10" nic="eth0" cidr_netmask="24" > > > > > broadcast="192.168.10.255" \ > > > > > op monitor interval="15s" timeout="25s" on-fail="standby" \ > > > > > meta target-role="Started" > > > > > clone pgclone pgsql \ > > > > > meta notify="true" globally-unique="false" interleave="true" > > > > > target-role="Started" > > > > > colocation ip-with-slony inf: slony-fail vir-ip > > > > > order slony-b4-ip inf: vir-ip slony-fail > > > > > property $id="cib-bootstrap-options" \ > > > > > dc-version="1.0.5-3840e6b5a305ccb803d29b468556739e75532d56" \ > > > > > cluster-infrastructure="Heartbeat" \ > > > > > no-quorum-policy="ignore" \ > > > > > stonith-enabled="false" \ > > > > > last-lrm-refresh="1266488780" > > > > > rsc_defaults $id="rsc-options" \ > > > > > resource-stickiness="INFINITY" > > > > > > > > > > > > > > > > > > > > I am assigning the cluster-ip (192.168.10.10) in eth0 with ip > > > > 192.168.10.129 > > > > > in one machine and 192.168.10.130 in another machine. > > > > > > > > > > When I pull out the eth0 interface cable fail-over is not taking > > place. > > > > > > > > That's split brain. More than a resource failure. Without > > > > stonith, you'll have both nodes running all resources. > > > > > > > > > This is the log message i am getting while I pull out the cable: > > > > > > > > > > "Feb 18 16:55:58 node2 NetworkManager: <info> (eth0): carrier now > OFF > > > > > (device state 1)" > > > > > > > > > > and after a miniute or two > > > > > > > > > > log snippet: > > > > > ------------------------------------------------------------------- > > > > > Feb 18 16:57:37 node2 cib: [21940]: info: cib_stats: Processed 3 > > > > operations > > > > > (13333.00us average, 0% utilization) in the last 10min > > > > > Feb 18 17:02:53 node2 crmd: [21944]: info: crm_timer_popped: > PEngine > > > > Recheck > > > > > Timer (I_PE_CALC) just popped! > > > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: > State > > > > > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC > > > > cause=C_TIMER_POPPED > > > > > origin=crm_timer_popped ] > > > > > Feb 18 17:02:53 node2 crmd: [21944]: WARN: do_state_transition: > > > > Progressed > > > > > to state S_POLICY_ENGINE after C_TIMER_POPPED > > > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: All > 2 > > > > > cluster nodes are eligible to run resources. > > > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_pe_invoke: Query 111: > > > > > Requesting the current CIB: S_POLICY_ENGINE > > > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_pe_invoke_callback: > > > > Invoking > > > > > the PE: ref=pe_calc-dc-1266492773-121, seq=2, quorate=1 > > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_config: On > loss > > of > > > > > CCM Quorum: Ignore > > > > > Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_config: Node > > scores: > > > > > 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 > > > > > Feb 18 17:02:53 node2 pengine: [21982]: info: > determine_online_status: > > > > Node > > > > > node2 is online > > > > > Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_rsc_op: > > > > > slony-fail_monitor_0 on node2 returned 0 (ok) instead of the > expected > > > > value: > > > > > 7 (not running) > > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_rsc_op: > > Operation > > > > > slony-fail_monitor_0 found resource slony-fail active on node2 > > > > > Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_rsc_op: > > > > > pgsql:0_monitor_0 on node2 returned 0 (ok) instead of the expected > > value: > > > > 7 > > > > > (not running) > > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_rsc_op: > > Operation > > > > > pgsql:0_monitor_0 found resource pgsql:0 active on node2 > > > > > Feb 18 17:02:53 node2 pengine: [21982]: info: > determine_online_status: > > > > Node > > > > > node1 is online > > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: native_print: > > > > > vir-ip#011(ocf::heartbeat:IPaddr2):#011Started node2 > > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: native_print: > > > > > slony-fail#011(lsb:slony_failover):#011Started node2 > > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: clone_print: Clone > > Set: > > > > > pgclone > > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: print_list: > > #011Started: > > > > [ > > > > > node2 node1 ] > > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: RecurringOp: Start > > > > > recurring monitor (15s) for pgsql:1 on node1 > > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave > > > > resource > > > > > vir-ip#011(Started node2) > > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave > > > > resource > > > > > slony-fail#011(Started node2) > > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave > > > > resource > > > > > pgsql:0#011(Started node2) > > > > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave > > > > resource > > > > > pgsql:1#011(Started node1) > > > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: > State > > > > > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ > input=I_PE_SUCCESS > > > > > cause=C_IPC_MESSAGE origin=handle_response ] > > > > > Feb 18 17:02:53 node2 crmd: [21944]: info: unpack_graph: Unpacked > > > > transition > > > > > 26: 1 actions in 1 synapses > > > > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_te_invoke: Processing > > graph > > > > 26 > > > > > (ref=pe_calc-dc-1266492773-121) derived from > > > > > /var/lib/pengine/pe-input-125.bz2 > > > > > Feb 18 17:02:53 node2 crmd: [21944]: info: te_rsc_command: > Initiating > > > > action > > > > > 15: monitor pgsql:1_monitor_15000 on node1 > > > > > Feb 18 17:02:53 node2 pengine: [21982]: ERROR: write_last_sequence: > > > > Cannout > > > > > open series file /var/lib/pengine/pe-input.last for writing > > > > > > > > This is probably a permission problem. /var/lib/pengine should be > > > > owned by haclient:hacluster. > > > > > > > > > Feb 18 17:02:53 node2 pengine: [21982]: info: process_pe_message: > > > > Transition > > > > > 26: PEngine Input stored in: /var/lib/pengine/pe-input-125.bz2 > > > > > Feb 18 17:02:55 node2 crmd: [21944]: info: match_graph_event: > Action > > > > > pgsql:1_monitor_15000 (15) confirmed on node1 (rc=0) > > > > > Feb 18 17:02:55 node2 crmd: [21944]: info: run_graph: > > > > > ==================================================== > > > > > Feb 18 17:02:55 node2 crmd: [21944]: notice: run_graph: Transition > 26 > > > > > (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, > > > > > Source=/var/lib/pengine/pe-input-125.bz2): Complete > > > > > Feb 18 17:02:55 node2 crmd: [21944]: info: te_graph_trigger: > > Transition > > > > 26 > > > > > is now complete > > > > > Feb 18 17:02:55 node2 crmd: [21944]: info: notify_crmd: Transition > 26 > > > > > status: done - <null> > > > > > Feb 18 17:02:55 node2 crmd: [21944]: info: do_state_transition: > State > > > > > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > > > > > cause=C_FSA_INTERNAL origin=notify_crmd ] > > > > > Feb 18 17:02:55 node2 crmd: [21944]: info: do_state_transition: > > Starting > > > > > PEngine Recheck Timer > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > > > > > Don't see anything in the logs about the IP address resource. > > > > > > > > > Also I am not able to use the pgsql ocf script and hence I am using > > the > > > > init > > > > > > > > Why is that? Something wrong with pgsql? If so, then it should be > > > > fixed. It's always much better to use the OCF instead of LSB RA. > > > > > > > > Thanks, > > > > > > > > Dejan > > > > > > > > > script and cloned it as I need to run it on both nodes for slony > data > > > > base > > > > > replication. > > > > > > > > > > I am using the heartbeat and pacemaker debs from the updated ubuntu > > > > karmic > > > > > repo. (Heartbeat 2.99) > > > > > > > > > > Please check my configuration and tell me where I am > > missing....[?][?][?] > > > > > -- > > > > > Regards, > > > > > > > > > > Jayakrishnan. L > > > > > > > > > > Visit: www.jayakrishnan.bravehost.com > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Pacemaker mailing list > > > > > Pacemaker@oss.clusterlabs.org > > > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > > > > > > > _______________________________________________ > > > > Pacemaker mailing list > > > > Pacemaker@oss.clusterlabs.org > > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > > > > > > > > > > > -- > > > Regards, > > > > > > Jayakrishnan. L > > > > > > Visit: www.jayakrishnan.bravehost.com > > > > > _______________________________________________ > > > Pacemaker mailing list > > > Pacemaker@oss.clusterlabs.org > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > _______________________________________________ > > Pacemaker mailing list > > Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > > > -- > > Regards, > > > > Jayakrishnan. L > > > > Visit: www.jayakrishnan.bravehost.com > > > _______________________________________________ > > Pacemaker mailing list > > Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > _______________________________________________ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > -- Regards, Jayakrishnan. L Visit: www.jayakrishnan.bravehost.com
_______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker