Hi Andrew,

Thank you for comment.

> Perhaps I misunderstood - does the node fail _while_ we're running 
> post_notify_start_0?
> Is that the ordering you're talking about?

Yes.
I think that stonith do not have to wait for post_notify_start_0 of the 
inoperative node.

> If so, then the crmd is already supposed to be smart enough not to bother 
> waiting for those actions - perhaps the logic got broken at some point.

If you need detailed information, give me communication.

Best Regards,
Hideo Yamauchi.


--- On Tue, 2011/2/15, Andrew Beekhof <and...@beekhof.net> wrote:

> 
> On Feb 15, 2011, at 12:10 PM, renayama19661...@ybb.ne.jp wrote:
> 
> > Hi Andrew,
> > 
> > Thank you for comment.
> > 
> > Sorry...I may understand your opinion by mistake.
> > 
> >> Wait, what?
> >> Why would post_notify_start_0 of prmApPostgreSQLDB block stonith?
> > 
> > Yes.
> > I was able to see it like that.
> > post_notify_start_0 of crmd seems to keep processing waiting to time-out.
> > 
> > Feb 10 16:33:19 srv01 crmd: [4293]: info: te_rsc_command: Initiating action 
> > 76: notify prmApPostgreSQLDB:1_post_notify_start_0 on srv02
> > Feb 10 16:33:19 srv01 lrmd: [4290]: info: RA output: 
> > (prmApPostgreSQLDB:0:notify:stdout) usage: 
> > /usr/lib/ocf/resource.d//pacemaker/Stateful 
> > {start|stop|promote|demote|monitor|validate-all|meta-data}  Expects to have 
> > a fully populated OCF RA-compliant environment set. 
> > Feb 10 16:33:19 srv01 crmd: [4293]: info: process_lrm_event: LRM operation 
> > prmApPostgreSQLDB:0_notify_0 (call=24, rc=0, cib-update=101, 
> > confirmed=true) ok
> > Feb 10 16:33:19 srv01 crmd: [4293]: info: match_graph_event: Action 
> > prmApPostgreSQLDB:0_post_notify_start_0 (75) confirmed on srv01 (rc=0)
> > Feb 10 16:33:19 srv01 crmd: [4293]: info: te_pseudo_action: Pseudo action 
> > 40 fired and confirmed
> > Feb 10 16:34:39 srv01 crmd: [4293]: WARN: action_timer_callback: Timer 
> > popped (timeout=20000, abort_level=1000000, complete=false)
> > Feb 10 16:34:39 srv01 crmd: [4293]: ERROR: print_elem: Aborting transition, 
> > action lost: [Action 76]: Failed (id: 
> > prmApPostgreSQLDB:1_post_notify_start_0, loc: srv02, priority: 1000000)
> > Feb 10 16:34:39 srv01 crmd: [4293]: info: abort_transition_graph: 
> > action_timer_callback:486 - Triggered transition abort (complete=0) : 
> > Action lost
> > Feb 10 16:34:39 srv01 crmd: [4293]: WARN: cib_action_update: rsc_op 76: 
> > prmApPostgreSQLDB:1_post_notify_start_0 on srv02 timed out
> > 
> > 
> >> You didn't put a stonith resource in an ordering constraint did you?
> > I did not set order of stonith. 
> > We want to carry out STONITH without waiting for time-out of 
> > post_notify_start_0
> > Is there a method to solve this problem?
> 
> Perhaps I misunderstood - does the node fail _while_ we're running 
> post_notify_start_0?
> Is that the ordering you're talking about?
> 
> If so, then the crmd is already supposed to be smart enough not to bother 
> waiting for those actions - perhaps the logic got broken at some point.
> 
> > 
> > Best Regards,
> > Hideo Yamauchi.
> > 
> > 
> > 
> > --- On Tue, 2011/2/15, Andrew Beekhof <and...@beekhof.net> wrote:
> > 
> >> On Tue, Feb 15, 2011 at 9:32 AM,  <renayama19661...@ybb.ne.jp> wrote:
> >>> Hi all,
> >>> 
> >>> We test trouble at the time of the start of the Master/Slave resource.
> >>> 
> >>> Step1) We start the first node and send cib.
> >>> 
> >>> ============
> >>> Last updated: Thu Feb 10 16:32:12 2011
> >>> Stack: Heartbeat
> >>> Current DC: srv01 (c7435833-8bc5-43aa-8195-c666b818677f) - partition with 
> >>> quorum
> >>> Version: 1.0.10-b0266dd5ffa9c51377c68b1f29d6bc84367f51dd
> >>> 1 Nodes configured, unknown expected votes
> >>> 5 Resources configured.
> >>> ============
> >>> 
> >>> Online: [ srv01 ]
> >>> 
> >>> prmIpPostgreSQLDB      (ocf::heartbeat:IPaddr2):       Started srv01
> >>> Resource Group: grpStonith2
> >>>     prmStonith2-2      (stonith:external/ssh): Started srv01
> >>>     prmStonith2-3      (stonith:meatware):     Started srv01
> >>> Master/Slave Set: msPostgreSQLDB
> >>>     Masters: [ srv01 ]
> >>>     Stopped: [ prmApPostgreSQLDB:1 ]
> >>> Clone Set: clnPingd
> >>>     Started: [ srv01 ]
> >>>     Stopped: [ prmPingd:1 ]
> >>> 
> >>> Migration summary:
> >>> * Node srv01:
> >>> 
> >>> 
> >>> Step2) We change Stateful of the node of the second of them.
> >>> (snip)
> >>> stateful_start() {
> >>>    ocf_log info "Start of Stateful."
> >>>    sleep 120                            ----> add sleep.
> >>>    stateful_check_state master
> >>> (snip)
> >>> 
> >>> Step3) We start a node of the second of them.
> >>> 
> >>> Step4) We confirm "sleep" and reboot a node of the second of them.
> >>> [root@srv02 ~]# ps -ef |grep sleep
> >>> 
> >>> Step5) The node of the first of them detects the disappearance of the 
> >>> node of the second of them.
> >>> * But, STONITH is delayed because "post_notify_start_0" of the node of 
> >>> the second of them is carried
> >>> out.
> >> 
> >> Wait, what?
> >> Why would post_notify_start_0 of prmApPostgreSQLDB block stonith?
> >> 
> >> You didn't put a stonith resource in an ordering constraint did you?
> >> 
> >> 
> >>> * In the srv02 node that disappeared, we do not need the practice of 
> >>> post_notify_start_0.
> >>> * STONITH should be carried out immediately.(STONITH is kept waiting to 
> >>> time-out of
> >>> post_notify_start_0 now.)
> >>> 
> >>> 
> >>> --------------(snip)----------------
> >>> Feb 10 16:33:18 srv01 crmd: [4293]: info: ccm_event_detail: NEW 
> >>> MEMBERSHIP: trans=3, nodes=1, new=0,
> >>> lost=1 n_idx=0, new_idx=1, old_idx=3
> >>> Feb 10 16:33:18 srv01 crmd: [4293]: info: ccm_event_detail:     CURRENT: 
> >>> srv01 [nodeid=0, born=3]
> >>> Feb 10 16:33:18 srv01 crmd: [4293]: info: ccm_event_detail:     LOST:    
> >>> srv02 [nodeid=1, born=2]
> >>> Feb 10 16:33:18 srv01 crmd: [4293]: info: ais_status_callback: status: 
> >>> srv02 is now lost (was member)
> >>> Feb 10 16:33:18 srv01 crmd: [4293]: info: crm_update_peer: Node srv02: 
> >>> id=1 state=lost (new)
> >>> addr=(null) votes=-1 born=2 seen=2 proc=00000000000000000000000000000200
> >>> Feb 10 16:33:18 srv01 crmd: [4293]: info: erase_node_from_join: Removed 
> >>> node srv02 from join
> >>> calculations: welcomed=0 itegrated=0 finalized=0 confirmed=1
> >>> Feb 10 16:33:18 srv01 crmd: [4293]: info: populate_cib_nodes_ha: 
> >>> Requesting the list of configured
> >>> nodes
> >>> Feb 10 16:33:19 srv01 crmd: [4293]: info: te_pseudo_action: Pseudo action 
> >>> 36 fired and confirmed
> >>> Feb 10 16:33:19 srv01 crmd: [4293]: info: te_pseudo_action: Pseudo action 
> >>> 39 fired and confirmed
> >>> Feb 10 16:33:19 srv01 crmd: [4293]: info: te_rsc_command: Initiating 
> >>> action 75: notify
> >>> prmApPostgreSQLDB:0_post_notify_start_0 on srv01 (local)
> >>> Feb 10 16:33:19 srv01 crmd: [4293]: info: do_lrm_rsc_op: Performing
> >>> key=75:7:0:6918f8dc-fe1a-4c28-8aff-e8ac7a5e7143 
> >>> op=prmApPostgreSQLDB:0_notify_0 )
> >>> Feb 10 16:33:19 srv01 lrmd: [4290]: info: rsc:prmApPostgreSQLDB:0:24: 
> >>> notify
> >>> Feb 10 16:33:19 srv01 cib: [4289]: info: cib_process_request: Operation 
> >>> complete: op cib_modify for
> >>> section nodes (origin=local/crmd/99, version=0.9.22): ok (rc=0)
> >>> Feb 10 16:33:19 srv01 crmd: [4293]: info: te_rsc_command: Initiating 
> >>> action 76: notify
> >>> prmApPostgreSQLDB:1_post_notify_start_0 on srv02
> >>> Feb 10 16:33:19 srv01 lrmd: [4290]: info: RA output: 
> >>> (prmApPostgreSQLDB:0:notify:stdout) usage:
> >>> /usr/lib/ocf/resource.d//pacemaker/Stateful 
> >>> {start|stop|promote|demote|monitor|validate-all|meta-data}
> >>> Expects to have a fully populated OCF RA-compliant environment set.
> >>> Feb 10 16:33:19 srv01 crmd: [4293]: info: process_lrm_event: LRM operation
> >>> prmApPostgreSQLDB:0_notify_0 (call=24, rc=0, cib-update=101, 
> >>> confirmed=true) ok
> >>> Feb 10 16:33:19 srv01 crmd: [4293]: info: match_graph_event: Action
> >>> prmApPostgreSQLDB:0_post_notify_start_0 (75) confirmed on srv01 (rc=0)
> >>> Feb 10 16:33:19 srv01 crmd: [4293]: info: te_pseudo_action: Pseudo action 
> >>> 40 fired and confirmed
> >>> Feb 10 16:34:39 srv01 crmd: [4293]: WARN: action_timer_callback: Timer 
> >>> popped (timeout=20000,
> >>> abort_level=1000000, complete=false)
> >>> Feb 10 16:34:39 srv01 crmd: [4293]: ERROR: print_elem: Aborting 
> >>> transition, action lost: [Action 76]:
> >>> Failed (id: prmApPostgreSQLDB:1_post_notify_start_0, loc: srv02, 
> >>> priority: 1000000)
> >>> Feb 10 16:34:39 srv01 crmd: [4293]: info: abort_transition_graph: 
> >>> action_timer_callback:486 -
> >>> Triggered transition abort (complete=0) : Action lost
> >>> Feb 10 16:34:39 srv01 crmd: [4293]: WARN: cib_action_update: rsc_op 76:
> >>> prmApPostgreSQLDB:1_post_notify_start_0 on srv02 timed out
> >>> Feb 10 16:34:39 srv01 crmd: [4293]: info: run_graph:
> >>> ====================================================
> >>> Feb 10 16:34:39 srv01 crmd: [4293]: notice: run_graph: Transition 7 
> >>> (Complete=16, Pending=0, Fired=0,
> >>> Skipped=1, Incomplete=0, Source=/var/lib/pengine/pe-input-7.bz2): Stopped
> >>> Feb 10 16:34:39 srv01 crmd: [4293]: info: te_graph_trigger: Transition 7 
> >>> is now complete
> >>> Feb 10 16:34:39 srv01 crmd: [4293]: info: do_state_transition: State 
> >>> transition S_TRANSITION_ENGINE ->
> >>> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd 
> >>> ]
> >>> ---------------(snip)------------------
> >>> 
> >>> Will not this movement be bug?
> >>> 
> >>>  *  I registered the log with Bugzilla.
> >>>  * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2559
> >>> 
> >>> Best Regards,
> >>> Hideo Yamauchi.
> >>> 
> >>> 
> >>> 
> >>> _______________________________________________
> >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>> 
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: 
> >>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >>> 
> >> 
> 
> 

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Reply via email to