Hi Andrew, Thank you for comment.
> Perhaps I misunderstood - does the node fail _while_ we're running > post_notify_start_0? > Is that the ordering you're talking about? Yes. I think that stonith do not have to wait for post_notify_start_0 of the inoperative node. > If so, then the crmd is already supposed to be smart enough not to bother > waiting for those actions - perhaps the logic got broken at some point. If you need detailed information, give me communication. Best Regards, Hideo Yamauchi. --- On Tue, 2011/2/15, Andrew Beekhof <and...@beekhof.net> wrote: > > On Feb 15, 2011, at 12:10 PM, renayama19661...@ybb.ne.jp wrote: > > > Hi Andrew, > > > > Thank you for comment. > > > > Sorry...I may understand your opinion by mistake. > > > >> Wait, what? > >> Why would post_notify_start_0 of prmApPostgreSQLDB block stonith? > > > > Yes. > > I was able to see it like that. > > post_notify_start_0 of crmd seems to keep processing waiting to time-out. > > > > Feb 10 16:33:19 srv01 crmd: [4293]: info: te_rsc_command: Initiating action > > 76: notify prmApPostgreSQLDB:1_post_notify_start_0 on srv02 > > Feb 10 16:33:19 srv01 lrmd: [4290]: info: RA output: > > (prmApPostgreSQLDB:0:notify:stdout) usage: > > /usr/lib/ocf/resource.d//pacemaker/Stateful > > {start|stop|promote|demote|monitor|validate-all|meta-data} Expects to have > > a fully populated OCF RA-compliant environment set. > > Feb 10 16:33:19 srv01 crmd: [4293]: info: process_lrm_event: LRM operation > > prmApPostgreSQLDB:0_notify_0 (call=24, rc=0, cib-update=101, > > confirmed=true) ok > > Feb 10 16:33:19 srv01 crmd: [4293]: info: match_graph_event: Action > > prmApPostgreSQLDB:0_post_notify_start_0 (75) confirmed on srv01 (rc=0) > > Feb 10 16:33:19 srv01 crmd: [4293]: info: te_pseudo_action: Pseudo action > > 40 fired and confirmed > > Feb 10 16:34:39 srv01 crmd: [4293]: WARN: action_timer_callback: Timer > > popped (timeout=20000, abort_level=1000000, complete=false) > > Feb 10 16:34:39 srv01 crmd: [4293]: ERROR: print_elem: Aborting transition, > > action lost: [Action 76]: Failed (id: > > prmApPostgreSQLDB:1_post_notify_start_0, loc: srv02, priority: 1000000) > > Feb 10 16:34:39 srv01 crmd: [4293]: info: abort_transition_graph: > > action_timer_callback:486 - Triggered transition abort (complete=0) : > > Action lost > > Feb 10 16:34:39 srv01 crmd: [4293]: WARN: cib_action_update: rsc_op 76: > > prmApPostgreSQLDB:1_post_notify_start_0 on srv02 timed out > > > > > >> You didn't put a stonith resource in an ordering constraint did you? > > I did not set order of stonith. > > We want to carry out STONITH without waiting for time-out of > > post_notify_start_0 > > Is there a method to solve this problem? > > Perhaps I misunderstood - does the node fail _while_ we're running > post_notify_start_0? > Is that the ordering you're talking about? > > If so, then the crmd is already supposed to be smart enough not to bother > waiting for those actions - perhaps the logic got broken at some point. > > > > > Best Regards, > > Hideo Yamauchi. > > > > > > > > --- On Tue, 2011/2/15, Andrew Beekhof <and...@beekhof.net> wrote: > > > >> On Tue, Feb 15, 2011 at 9:32 AM, <renayama19661...@ybb.ne.jp> wrote: > >>> Hi all, > >>> > >>> We test trouble at the time of the start of the Master/Slave resource. > >>> > >>> Step1) We start the first node and send cib. > >>> > >>> ============ > >>> Last updated: Thu Feb 10 16:32:12 2011 > >>> Stack: Heartbeat > >>> Current DC: srv01 (c7435833-8bc5-43aa-8195-c666b818677f) - partition with > >>> quorum > >>> Version: 1.0.10-b0266dd5ffa9c51377c68b1f29d6bc84367f51dd > >>> 1 Nodes configured, unknown expected votes > >>> 5 Resources configured. > >>> ============ > >>> > >>> Online: [ srv01 ] > >>> > >>> prmIpPostgreSQLDB (ocf::heartbeat:IPaddr2): Started srv01 > >>> Resource Group: grpStonith2 > >>> prmStonith2-2 (stonith:external/ssh): Started srv01 > >>> prmStonith2-3 (stonith:meatware): Started srv01 > >>> Master/Slave Set: msPostgreSQLDB > >>> Masters: [ srv01 ] > >>> Stopped: [ prmApPostgreSQLDB:1 ] > >>> Clone Set: clnPingd > >>> Started: [ srv01 ] > >>> Stopped: [ prmPingd:1 ] > >>> > >>> Migration summary: > >>> * Node srv01: > >>> > >>> > >>> Step2) We change Stateful of the node of the second of them. > >>> (snip) > >>> stateful_start() { > >>> ocf_log info "Start of Stateful." > >>> sleep 120 ----> add sleep. > >>> stateful_check_state master > >>> (snip) > >>> > >>> Step3) We start a node of the second of them. > >>> > >>> Step4) We confirm "sleep" and reboot a node of the second of them. > >>> [root@srv02 ~]# ps -ef |grep sleep > >>> > >>> Step5) The node of the first of them detects the disappearance of the > >>> node of the second of them. > >>> * But, STONITH is delayed because "post_notify_start_0" of the node of > >>> the second of them is carried > >>> out. > >> > >> Wait, what? > >> Why would post_notify_start_0 of prmApPostgreSQLDB block stonith? > >> > >> You didn't put a stonith resource in an ordering constraint did you? > >> > >> > >>> * In the srv02 node that disappeared, we do not need the practice of > >>> post_notify_start_0. > >>> * STONITH should be carried out immediately.(STONITH is kept waiting to > >>> time-out of > >>> post_notify_start_0 now.) > >>> > >>> > >>> --------------(snip)---------------- > >>> Feb 10 16:33:18 srv01 crmd: [4293]: info: ccm_event_detail: NEW > >>> MEMBERSHIP: trans=3, nodes=1, new=0, > >>> lost=1 n_idx=0, new_idx=1, old_idx=3 > >>> Feb 10 16:33:18 srv01 crmd: [4293]: info: ccm_event_detail: CURRENT: > >>> srv01 [nodeid=0, born=3] > >>> Feb 10 16:33:18 srv01 crmd: [4293]: info: ccm_event_detail: LOST: > >>> srv02 [nodeid=1, born=2] > >>> Feb 10 16:33:18 srv01 crmd: [4293]: info: ais_status_callback: status: > >>> srv02 is now lost (was member) > >>> Feb 10 16:33:18 srv01 crmd: [4293]: info: crm_update_peer: Node srv02: > >>> id=1 state=lost (new) > >>> addr=(null) votes=-1 born=2 seen=2 proc=00000000000000000000000000000200 > >>> Feb 10 16:33:18 srv01 crmd: [4293]: info: erase_node_from_join: Removed > >>> node srv02 from join > >>> calculations: welcomed=0 itegrated=0 finalized=0 confirmed=1 > >>> Feb 10 16:33:18 srv01 crmd: [4293]: info: populate_cib_nodes_ha: > >>> Requesting the list of configured > >>> nodes > >>> Feb 10 16:33:19 srv01 crmd: [4293]: info: te_pseudo_action: Pseudo action > >>> 36 fired and confirmed > >>> Feb 10 16:33:19 srv01 crmd: [4293]: info: te_pseudo_action: Pseudo action > >>> 39 fired and confirmed > >>> Feb 10 16:33:19 srv01 crmd: [4293]: info: te_rsc_command: Initiating > >>> action 75: notify > >>> prmApPostgreSQLDB:0_post_notify_start_0 on srv01 (local) > >>> Feb 10 16:33:19 srv01 crmd: [4293]: info: do_lrm_rsc_op: Performing > >>> key=75:7:0:6918f8dc-fe1a-4c28-8aff-e8ac7a5e7143 > >>> op=prmApPostgreSQLDB:0_notify_0 ) > >>> Feb 10 16:33:19 srv01 lrmd: [4290]: info: rsc:prmApPostgreSQLDB:0:24: > >>> notify > >>> Feb 10 16:33:19 srv01 cib: [4289]: info: cib_process_request: Operation > >>> complete: op cib_modify for > >>> section nodes (origin=local/crmd/99, version=0.9.22): ok (rc=0) > >>> Feb 10 16:33:19 srv01 crmd: [4293]: info: te_rsc_command: Initiating > >>> action 76: notify > >>> prmApPostgreSQLDB:1_post_notify_start_0 on srv02 > >>> Feb 10 16:33:19 srv01 lrmd: [4290]: info: RA output: > >>> (prmApPostgreSQLDB:0:notify:stdout) usage: > >>> /usr/lib/ocf/resource.d//pacemaker/Stateful > >>> {start|stop|promote|demote|monitor|validate-all|meta-data} > >>> Expects to have a fully populated OCF RA-compliant environment set. > >>> Feb 10 16:33:19 srv01 crmd: [4293]: info: process_lrm_event: LRM operation > >>> prmApPostgreSQLDB:0_notify_0 (call=24, rc=0, cib-update=101, > >>> confirmed=true) ok > >>> Feb 10 16:33:19 srv01 crmd: [4293]: info: match_graph_event: Action > >>> prmApPostgreSQLDB:0_post_notify_start_0 (75) confirmed on srv01 (rc=0) > >>> Feb 10 16:33:19 srv01 crmd: [4293]: info: te_pseudo_action: Pseudo action > >>> 40 fired and confirmed > >>> Feb 10 16:34:39 srv01 crmd: [4293]: WARN: action_timer_callback: Timer > >>> popped (timeout=20000, > >>> abort_level=1000000, complete=false) > >>> Feb 10 16:34:39 srv01 crmd: [4293]: ERROR: print_elem: Aborting > >>> transition, action lost: [Action 76]: > >>> Failed (id: prmApPostgreSQLDB:1_post_notify_start_0, loc: srv02, > >>> priority: 1000000) > >>> Feb 10 16:34:39 srv01 crmd: [4293]: info: abort_transition_graph: > >>> action_timer_callback:486 - > >>> Triggered transition abort (complete=0) : Action lost > >>> Feb 10 16:34:39 srv01 crmd: [4293]: WARN: cib_action_update: rsc_op 76: > >>> prmApPostgreSQLDB:1_post_notify_start_0 on srv02 timed out > >>> Feb 10 16:34:39 srv01 crmd: [4293]: info: run_graph: > >>> ==================================================== > >>> Feb 10 16:34:39 srv01 crmd: [4293]: notice: run_graph: Transition 7 > >>> (Complete=16, Pending=0, Fired=0, > >>> Skipped=1, Incomplete=0, Source=/var/lib/pengine/pe-input-7.bz2): Stopped > >>> Feb 10 16:34:39 srv01 crmd: [4293]: info: te_graph_trigger: Transition 7 > >>> is now complete > >>> Feb 10 16:34:39 srv01 crmd: [4293]: info: do_state_transition: State > >>> transition S_TRANSITION_ENGINE -> > >>> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd > >>> ] > >>> ---------------(snip)------------------ > >>> > >>> Will not this movement be bug? > >>> > >>> * I registered the log with Bugzilla. > >>> * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2559 > >>> > >>> Best Regards, > >>> Hideo Yamauchi. > >>> > >>> > >>> > >>> _______________________________________________ > >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >>> > >>> Project Home: http://www.clusterlabs.org > >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >>> Bugs: > >>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > >>> > >> > > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker