Hello Dejan, First of all thank you very much for your reply. I found that one of my node is having the permission problem. There the permission of /var/lib/pengine file was set to "999:999" I am not sure how!!!!!! However i changed it...
sir, when I pull out the interface cable i am getting only this log message: Feb 18 16:55:58 node2 NetworkManager: <info> (eth0): carrier now OFF (device state 1) And the resource ip is not moving any where at all. It is still there in the same machine... I acn view that the IP is still assigned to the eth0 interface via "# ip addr show", even though the interface status is 'down.'. Is this the split-brain?? If so how can I clear it?? Because of the on-fail=standy in pgsql part in my cib I am able to do a failover to another node when I manuallyu stop the postgres service in tha active machine. however even after restarting the postgres service via "/etc/init.d/postgresql-8.4 start " I have to run crm resource cleanup <pgclone> to make the crm_mon or cluster identify that the service on. Till then It is showing as a failed action crm_mon snippet -------------------------------------------------------------------- Last updated: Thu Feb 18 20:17:28 2010 Stack: Heartbeat Current DC: node2 (3952b93e-786c-47d4-8c2f-a882e3d3d105) - partition with quorum Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56 2 Nodes configured, unknown expected votes 3 Resources configured. ============ Node node2 (3952b93e-786c-47d4-8c2f-a882e3d3d105): standby (on-fail) Online: [ node1 ] vir-ip (ocf::heartbeat:IPaddr2): Started node1 slony-fail (lsb:slony_failover): Started node1 Clone Set: pgclone Started: [ node1 ] Stopped: [ pgsql:0 ] Failed actions: pgsql:0_monitor_15000 (node=node2, call=33, rc=7, status=complete): not running -------------------------------------------------------------------------------- Is there any way to run crm resource cleanup <resource> periodically?? I dont know if there is any mistake in pgsql ocf script sir.. I have given all parameters correctly but its is giving an error " syntax error" all the time when I use it.. I put the same meta attributes as for the current lsb as shown below... Please help me out... should I reinstall the nodes again?? On Thu, Feb 18, 2010 at 6:50 PM, Dejan Muhamedagic <deja...@fastmail.fm>wrote: > Hi, > > On Thu, Feb 18, 2010 at 05:09:09PM +0530, Jayakrishnan wrote: > > sir, > > > > I have set up a two node cluster in Ubuntu 9.1. I have added a cluster-ip > > using ocf:heartbeat:IPaddr2, clonned lsb script "postgresql-8.4" and also > > added a manually created script for slony database replication. > > > > Now every thing works fine but I am not able to use the ocf resource > > scripts. I mean fail over is not taking place or else even resource is > not > > even taking. My ha.cf file and cib configuration is attached with this > mail > > > > My ha.cf file > > > > autojoin none > > keepalive 2 > > deadtime 15 > > warntime 5 > > initdead 64 > > udpport 694 > > bcast eth0 > > auto_failback off > > node node1 > > node node2 > > crm respawn > > use_logd yes > > > > > > My cib.xml configuration file in cli format: > > > > node $id="3952b93e-786c-47d4-8c2f-a882e3d3d105" node2 \ > > attributes standby="off" > > node $id="ac87f697-5b44-4720-a8af-12a6f2295930" node1 \ > > attributes standby="off" > > primitive pgsql lsb:postgresql-8.4 \ > > meta target-role="Started" resource-stickness="inherited" \ > > op monitor interval="15s" timeout="25s" on-fail="standby" > > primitive slony-fail lsb:slony_failover \ > > meta target-role="Started" > > primitive vir-ip ocf:heartbeat:IPaddr2 \ > > params ip="192.168.10.10" nic="eth0" cidr_netmask="24" > > broadcast="192.168.10.255" \ > > op monitor interval="15s" timeout="25s" on-fail="standby" \ > > meta target-role="Started" > > clone pgclone pgsql \ > > meta notify="true" globally-unique="false" interleave="true" > > target-role="Started" > > colocation ip-with-slony inf: slony-fail vir-ip > > order slony-b4-ip inf: vir-ip slony-fail > > property $id="cib-bootstrap-options" \ > > dc-version="1.0.5-3840e6b5a305ccb803d29b468556739e75532d56" \ > > cluster-infrastructure="Heartbeat" \ > > no-quorum-policy="ignore" \ > > stonith-enabled="false" \ > > last-lrm-refresh="1266488780" > > rsc_defaults $id="rsc-options" \ > > resource-stickiness="INFINITY" > > > > > > > > I am assigning the cluster-ip (192.168.10.10) in eth0 with ip > 192.168.10.129 > > in one machine and 192.168.10.130 in another machine. > > > > When I pull out the eth0 interface cable fail-over is not taking place. > > That's split brain. More than a resource failure. Without > stonith, you'll have both nodes running all resources. > > > This is the log message i am getting while I pull out the cable: > > > > "Feb 18 16:55:58 node2 NetworkManager: <info> (eth0): carrier now OFF > > (device state 1)" > > > > and after a miniute or two > > > > log snippet: > > ------------------------------------------------------------------- > > Feb 18 16:57:37 node2 cib: [21940]: info: cib_stats: Processed 3 > operations > > (13333.00us average, 0% utilization) in the last 10min > > Feb 18 17:02:53 node2 crmd: [21944]: info: crm_timer_popped: PEngine > Recheck > > Timer (I_PE_CALC) just popped! > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: State > > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC > cause=C_TIMER_POPPED > > origin=crm_timer_popped ] > > Feb 18 17:02:53 node2 crmd: [21944]: WARN: do_state_transition: > Progressed > > to state S_POLICY_ENGINE after C_TIMER_POPPED > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: All 2 > > cluster nodes are eligible to run resources. > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_pe_invoke: Query 111: > > Requesting the current CIB: S_POLICY_ENGINE > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_pe_invoke_callback: > Invoking > > the PE: ref=pe_calc-dc-1266492773-121, seq=2, quorate=1 > > Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_config: On loss of > > CCM Quorum: Ignore > > Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_config: Node scores: > > 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 > > Feb 18 17:02:53 node2 pengine: [21982]: info: determine_online_status: > Node > > node2 is online > > Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_rsc_op: > > slony-fail_monitor_0 on node2 returned 0 (ok) instead of the expected > value: > > 7 (not running) > > Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_rsc_op: Operation > > slony-fail_monitor_0 found resource slony-fail active on node2 > > Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_rsc_op: > > pgsql:0_monitor_0 on node2 returned 0 (ok) instead of the expected value: > 7 > > (not running) > > Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_rsc_op: Operation > > pgsql:0_monitor_0 found resource pgsql:0 active on node2 > > Feb 18 17:02:53 node2 pengine: [21982]: info: determine_online_status: > Node > > node1 is online > > Feb 18 17:02:53 node2 pengine: [21982]: notice: native_print: > > vir-ip#011(ocf::heartbeat:IPaddr2):#011Started node2 > > Feb 18 17:02:53 node2 pengine: [21982]: notice: native_print: > > slony-fail#011(lsb:slony_failover):#011Started node2 > > Feb 18 17:02:53 node2 pengine: [21982]: notice: clone_print: Clone Set: > > pgclone > > Feb 18 17:02:53 node2 pengine: [21982]: notice: print_list: #011Started: > [ > > node2 node1 ] > > Feb 18 17:02:53 node2 pengine: [21982]: notice: RecurringOp: Start > > recurring monitor (15s) for pgsql:1 on node1 > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave > resource > > vir-ip#011(Started node2) > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave > resource > > slony-fail#011(Started node2) > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave > resource > > pgsql:0#011(Started node2) > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave > resource > > pgsql:1#011(Started node1) > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: State > > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS > > cause=C_IPC_MESSAGE origin=handle_response ] > > Feb 18 17:02:53 node2 crmd: [21944]: info: unpack_graph: Unpacked > transition > > 26: 1 actions in 1 synapses > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_te_invoke: Processing graph > 26 > > (ref=pe_calc-dc-1266492773-121) derived from > > /var/lib/pengine/pe-input-125.bz2 > > Feb 18 17:02:53 node2 crmd: [21944]: info: te_rsc_command: Initiating > action > > 15: monitor pgsql:1_monitor_15000 on node1 > > Feb 18 17:02:53 node2 pengine: [21982]: ERROR: write_last_sequence: > Cannout > > open series file /var/lib/pengine/pe-input.last for writing > > This is probably a permission problem. /var/lib/pengine should be > owned by haclient:hacluster. > > > Feb 18 17:02:53 node2 pengine: [21982]: info: process_pe_message: > Transition > > 26: PEngine Input stored in: /var/lib/pengine/pe-input-125.bz2 > > Feb 18 17:02:55 node2 crmd: [21944]: info: match_graph_event: Action > > pgsql:1_monitor_15000 (15) confirmed on node1 (rc=0) > > Feb 18 17:02:55 node2 crmd: [21944]: info: run_graph: > > ==================================================== > > Feb 18 17:02:55 node2 crmd: [21944]: notice: run_graph: Transition 26 > > (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, > > Source=/var/lib/pengine/pe-input-125.bz2): Complete > > Feb 18 17:02:55 node2 crmd: [21944]: info: te_graph_trigger: Transition > 26 > > is now complete > > Feb 18 17:02:55 node2 crmd: [21944]: info: notify_crmd: Transition 26 > > status: done - <null> > > Feb 18 17:02:55 node2 crmd: [21944]: info: do_state_transition: State > > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > > cause=C_FSA_INTERNAL origin=notify_crmd ] > > Feb 18 17:02:55 node2 crmd: [21944]: info: do_state_transition: Starting > > PEngine Recheck Timer > > > ------------------------------------------------------------------------------ > > Don't see anything in the logs about the IP address resource. > > > Also I am not able to use the pgsql ocf script and hence I am using the > init > > Why is that? Something wrong with pgsql? If so, then it should be > fixed. It's always much better to use the OCF instead of LSB RA. > > Thanks, > > Dejan > > > script and cloned it as I need to run it on both nodes for slony data > base > > replication. > > > > I am using the heartbeat and pacemaker debs from the updated ubuntu > karmic > > repo. (Heartbeat 2.99) > > > > Please check my configuration and tell me where I am missing....[?][?][?] > > -- > > Regards, > > > > Jayakrishnan. L > > > > Visit: www.jayakrishnan.bravehost.com > > > > > > _______________________________________________ > > Pacemaker mailing list > > Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > _______________________________________________ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > -- Regards, Jayakrishnan. L Visit: www.jayakrishnan.bravehost.com
_______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker