2008/7/31 jijun gao <[EMAIL PROTECTED]>: > hi, > I have two nodes, and when I start service heartbeat only on standby node, > > the resource keep on restarting itself, I don't know what is happening. > * > below is the resource infomation:* > [EMAIL PROTECTED] ~]# cibadmin -Q -o resources > <resources> > <group id="group_1"> > <primitive class="ocf" id="IPaddr_192_168_10_211" provider="heartbeat" > type="IPaddr"> > <operations> > <op id="IPaddr_192_168_10_211_mon" interval="1s" name="monitor" > timeout="1s"/>
very short interval and timeout > </operations> > <instance_attributes id="IPaddr_192_168_10_211_inst_attr"> > <attributes> > <nvpair id="IPaddr_192_168_10_211_attr_0" name="ip" value=" > 192.168.10.211"/> > </attributes> > </instance_attributes> > </primitive> > <primitive class="lsb" id="asterisk_2" provider="heartbeat" > type="asterisk"> > <operations> > <op id="asterisk_2_mon" interval="3s" name="monitor" timeout="2s"/> see above ... > </operations> > </primitive> > </group> > </resources> > > *and here is part of the system log: > *Jul 31 16:24:37 node2 last message repeated 9 times > Jul 31 16:24:37 node2 setroubleshoot: SELinux is preventing ifconfig > (ifconfig_t) "read write" to socket:[136168] (initrc_t). For complete > SELinux messages. run sealert -l 0db84664-2bd3-4f8f-a10e-1e0641417484 hmmm ... I'm not familiar with SELinux, but that looks suspicious to me. I assume on node1 SELinux is disabled? > Jul 31 16:24:37 node2 lrmd: [29544]: WARN: asterisk_2:monitor process (PID > 23374) timed out (try 1). Killing with signal SIGTERM (15). ... and because of the monitoring timeout the resource is declared dead and restarted. > Jul 31 16:24:37 node2 lrmd: [29544]: WARN: operation monitor[389] on > ocf::IPaddr::IPaddr_192_168_10_211 for client 29547, its parameters: > CRM_meta_interval=[1000] ip=[192.168.10.211] > CRM_meta_id=[IPaddr_192_168_10_211_mon] CRM_meta_timeout=[1000] > crm_feature_set=[2.0] CRM_meta_name=[monitor] : pid [23361] timed out > Jul 31 16:24:37 node2 crmd: [29547]: ERROR: process_lrm_event: LRM operation > IPaddr_192_168_10_211_monitor_1000 (389) Timed Out (timeout=1000ms) > Jul 31 16:24:37 node2 tengine: [29549]: info: process_graph_event: Action > IPaddr_192_168_10_211_monitor_1000 arrived after a completed transition > Jul 31 16:24:37 node2 tengine: [29549]: info: update_abort_priority: Abort > priority upgraded to 1000000 > Jul 31 16:24:37 node2 tengine: [29549]: WARN: update_failcount: Updating > failcount for IPaddr_192_168_10_211 on cddecca4-8275-4913-83a7-8e7d3324cefc > after failed monitor: rc=-2 > Jul 31 16:24:37 node2 crmd: [29547]: info: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_IPC_MESSAGE > origin=route_message ] > Jul 31 16:24:37 node2 crmd: [29547]: info: do_state_transition: All 1 > cluster nodes are eligible to run resources. > Jul 31 16:24:37 node2 pengine: [29550]: info: determine_online_status: Node > node2 is online > Jul 31 16:24:37 node2 pengine: [29550]: WARN: unpack_rsc_op: Processing > failed op IPaddr_192_168_10_211_monitor_1000 on node2: Timed Out > Jul 31 16:24:37 node2 pengine: [29550]: notice: group_print: Resource Group: > group_1 > Jul 31 16:24:37 node2 pengine: [29550]: notice: native_print: > IPaddr_192_168_10_211 (heartbeat::ocf:IPaddr): Started node2 FAILED > Jul 31 16:24:37 node2 pengine: [29550]: notice: native_print: > asterisk_2 (lsb:asterisk): Started node2 > Jul 31 16:24:37 node2 pengine: [29550]: notice: NoRoleChange: Recover > resource IPaddr_192_168_10_211 (node2) > > Jul 31 16:24:37 node2 pengine: [29550]: notice: StopRsc: node2 Stop > IPaddr_192_168_10_211 > Jul 31 16:24:37 node2 pengine: [29550]: notice: StartRsc: node2 > Start IPaddr_192_168_10_211 > Jul 31 16:24:37 node2 pengine: [29550]: notice: RecurringOp: node2 > IPaddr_192_168_10_211_monitor_1000 > Jul 31 16:24:37 node2 pengine: [29550]: notice: NoRoleChange: Leave resource > asterisk_2 (node2) > Jul 31 16:24:37 node2 crmd: [29547]: info: do_state_transition: State > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS > cause=C_IPC_MESSAGE origin=route_message ] > Jul 31 16:24:37 node2 lrmd: [29544]: WARN: operation monitor[391] on > lsb::asterisk::asterisk_2 for client 29547, its parameters: > CRM_meta_interval=[3000] CRM_meta_id=[asterisk_2_mon] > CRM_meta_timeout=[2000] crm_feature_set=[2.0] CRM_meta_name=[monitor] : pid > [23374] timed out > Jul 31 16:24:37 node2 pengine: [29550]: info: process_pe_message: Transition > 83: PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-384.bz2 > Jul 31 16:24:37 node2 tengine: [29549]: info: unpack_graph: Unpacked > transition 83: 11 actions in 11 synapses > Jul 31 16:24:37 node2 crmd: [29547]: ERROR: process_lrm_event: LRM operation > asterisk_2_monitor_3000 (391) Timed Out (timeout=2000ms) > Jul 31 16:24:37 node2 tengine: [29549]: info: te_pseudo_action: Pseudo > action 12 fired and confirmed > Jul 31 16:24:37 node2 tengine: [29549]: info: send_rsc_command: Initiating > action 8: asterisk_2_stop_0 on node2 > Jul 31 16:24:37 node2 tengine: [29549]: info: process_graph_event: Detected > action asterisk_2_monitor_3000 from a different transition: 82 vs. 83 > Jul 31 16:24:37 node2 crmd: [29547]: info: do_lrm_rsc_op: Performing > op=asterisk_2_stop_0 key=8:83:7f6166d0-2099-450e-81e0-3900a25ae8fd) > Jul 31 16:24:38 node2 tengine: [29549]: info: update_abort_priority: Abort > priority upgraded to 1000000 > Jul 31 16:24:38 node2 lrmd: [29544]: info: rsc:asterisk_2: stop > Jul 31 16:24:38 node2 tengine: [29549]: info: update_abort_priority: Abort > action 0 superceeded by 2 > Jul 31 16:24:38 node2 lrmd: [23380]: WARN: For LSB init script, no > additional parameters are needed. > > Jul 31 16:24:38 node2 crmd: [29547]: info: process_lrm_event: LRM operation > asterisk_2_monitor_3000 (call=391, rc=-2) Cancelled > Jul 31 16:24:38 node2 tengine: [29549]: WARN: update_failcount: Updating > failcount for asterisk_2 on cddecca4-8275-4913-83a7-8e7d3324cefc after > failed monitor: rc=-2 > Jul 31 16:24:38 node2 lrmd: [29544]: info: RA output: > (asterisk_2:stop:stdout) Shutting down asterisk: > Jul 31 16:24:38 node2 lrmd: [29544]: info: RA output: > (asterisk_2:stop:stdout) [ > Jul 31 16:24:38 node2 lrmd: [29544]: info: RA output: > (asterisk_2:stop:stdout) 确定 > Jul 31 16:24:38 node2 lrmd: [29544]: info: RA output: > (asterisk_2:stop:stdout) ]^M > Jul 31 16:24:38 node2 lrmd: [29544]: info: RA output: > (asterisk_2:stop:stdout) > Jul 31 16:24:38 node2 crmd: [29547]: info: process_lrm_event: LRM operation > asterisk_2_stop_0 (call=392, rc=0) complete > Jul 31 16:24:38 node2 tengine: [29549]: info: match_graph_event: Action > asterisk_2_stop_0 (8) confirmed on node2 (rc=0) > Jul 31 16:24:38 node2 tengine: [29549]: info: run_graph: > ==================================================== > Jul 31 16:24:38 node2 tengine: [29549]: notice: run_graph: Transition 83: > (Complete=2, Pending=0, Fired=0, Skipped=9, Incomplete=0) > Jul 31 16:24:38 node2 crmd: [29547]: info: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC > cause=C_IPC_MESSAGE origin=route_message ] > Jul 31 16:24:38 node2 crmd: [29547]: info: do_state_transition: All 1 > cluster nodes are eligible to run resources. > Jul 31 16:24:38 node2 pengine: [29550]: info: determine_online_status: Node > node2 is online > Jul 31 16:24:38 node2 pengine: [29550]: WARN: unpack_rsc_op: Processing > failed op IPaddr_192_168_10_211_monitor_1000 on node2: Timed Out > Jul 31 16:24:38 node2 pengine: [29550]: notice: group_print: Resource Group: > group_1 > Jul 31 16:24:38 node2 pengine: [29550]: notice: native_print: > IPaddr_192_168_10_211 (heartbeat::ocf:IPaddr): Started node2 FAILED > Jul 31 16:24:38 node2 pengine: [29550]: notice: native_print: > asterisk_2 (lsb:asterisk): Stopped > Jul 31 16:24:38 node2 pengine: [29550]: notice: NoRoleChange: Recover > resource IPaddr_192_168_10_211 (node2) > Jul 31 16:24:38 node2 pengine: [29550]: notice: StopRsc: node2 Stop > IPaddr_192_168_10_211 > > when I don't start heartbeat, but start service asterisk alone, it works > fine. > and when I start heartbeat on the primary node, it works fine too. > thanks for reading so long a letter. any ideas? > > * > * > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems >
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
