[Pacemaker] [Problem]post_notify_start_0 is carried out in the node that disappeared.

renayama19661014 Tue, 15 Feb 2011 00:35:49 -0800

Hi all,

We test trouble at the time of the start of the Master/Slave resource.


Step1) We start the first node and send cib.

============
Last updated: Thu Feb 10 16:32:12 2011
Stack: Heartbeat
Current DC: srv01 (c7435833-8bc5-43aa-8195-c666b818677f) - partition with quorum
Version: 1.0.10-b0266dd5ffa9c51377c68b1f29d6bc84367f51dd
1 Nodes configured, unknown expected votes
5 Resources configured.
============

Online: [ srv01 ]

prmIpPostgreSQLDB      (ocf::heartbeat:IPaddr2):       Started srv01
Resource Group: grpStonith2
     prmStonith2-2      (stonith:external/ssh): Started srv01
     prmStonith2-3      (stonith:meatware):     Started srv01
Master/Slave Set: msPostgreSQLDB
     Masters: [ srv01 ]
     Stopped: [ prmApPostgreSQLDB:1 ]
Clone Set: clnPingd
     Started: [ srv01 ]
     Stopped: [ prmPingd:1 ]

Migration summary:
* Node srv01: 


Step2) We change Stateful of the node of the second of them.
(snip)
stateful_start() {
    ocf_log info "Start of Stateful."
    sleep 120                            ----> add sleep.
    stateful_check_state master
(snip)

Step3) We start a node of the second of them.

Step4) We confirm "sleep" and reboot a node of the second of them.
[root@srv02 ~]# ps -ef |grep sleep

Step5) The node of the first of them detects the disappearance of the node of 
the second of them.
* But, STONITH is delayed because "post_notify_start_0" of the node of the 
second of them is carried
out.
* In the srv02 node that disappeared, we do not need the practice of 
post_notify_start_0.
* STONITH should be carried out immediately.(STONITH is kept waiting to 
time-out of
post_notify_start_0 now.)


--------------(snip)----------------
Feb 10 16:33:18 srv01 crmd: [4293]: info: ccm_event_detail: NEW MEMBERSHIP: 
trans=3, nodes=1, new=0,
lost=1 n_idx=0, new_idx=1, old_idx=3
Feb 10 16:33:18 srv01 crmd: [4293]: info: ccm_event_detail:     CURRENT: srv01 
[nodeid=0, born=3]
Feb 10 16:33:18 srv01 crmd: [4293]: info: ccm_event_detail:     LOST:    srv02 
[nodeid=1, born=2]
Feb 10 16:33:18 srv01 crmd: [4293]: info: ais_status_callback: status: srv02 is 
now lost (was member)
Feb 10 16:33:18 srv01 crmd: [4293]: info: crm_update_peer: Node srv02: id=1 
state=lost (new)
addr=(null) votes=-1 born=2 seen=2 proc=00000000000000000000000000000200
Feb 10 16:33:18 srv01 crmd: [4293]: info: erase_node_from_join: Removed node 
srv02 from join
calculations: welcomed=0 itegrated=0 finalized=0 confirmed=1
Feb 10 16:33:18 srv01 crmd: [4293]: info: populate_cib_nodes_ha: Requesting the 
list of configured
nodes
Feb 10 16:33:19 srv01 crmd: [4293]: info: te_pseudo_action: Pseudo action 36 
fired and confirmed
Feb 10 16:33:19 srv01 crmd: [4293]: info: te_pseudo_action: Pseudo action 39 
fired and confirmed
Feb 10 16:33:19 srv01 crmd: [4293]: info: te_rsc_command: Initiating action 75: 
notify
prmApPostgreSQLDB:0_post_notify_start_0 on srv01 (local)
Feb 10 16:33:19 srv01 crmd: [4293]: info: do_lrm_rsc_op: Performing
key=75:7:0:6918f8dc-fe1a-4c28-8aff-e8ac7a5e7143 op=prmApPostgreSQLDB:0_notify_0 
)
Feb 10 16:33:19 srv01 lrmd: [4290]: info: rsc:prmApPostgreSQLDB:0:24: notify
Feb 10 16:33:19 srv01 cib: [4289]: info: cib_process_request: Operation 
complete: op cib_modify for
section nodes (origin=local/crmd/99, version=0.9.22): ok (rc=0)
Feb 10 16:33:19 srv01 crmd: [4293]: info: te_rsc_command: Initiating action 76: 
notify
prmApPostgreSQLDB:1_post_notify_start_0 on srv02
Feb 10 16:33:19 srv01 lrmd: [4290]: info: RA output: 
(prmApPostgreSQLDB:0:notify:stdout) usage:
/usr/lib/ocf/resource.d//pacemaker/Stateful 
{start|stop|promote|demote|monitor|validate-all|meta-data}
Expects to have a fully populated OCF RA-compliant environment set.
Feb 10 16:33:19 srv01 crmd: [4293]: info: process_lrm_event: LRM operation
prmApPostgreSQLDB:0_notify_0 (call=24, rc=0, cib-update=101, confirmed=true) ok
Feb 10 16:33:19 srv01 crmd: [4293]: info: match_graph_event: Action
prmApPostgreSQLDB:0_post_notify_start_0 (75) confirmed on srv01 (rc=0)
Feb 10 16:33:19 srv01 crmd: [4293]: info: te_pseudo_action: Pseudo action 40 
fired and confirmed
Feb 10 16:34:39 srv01 crmd: [4293]: WARN: action_timer_callback: Timer popped 
(timeout=20000,
abort_level=1000000, complete=false)
Feb 10 16:34:39 srv01 crmd: [4293]: ERROR: print_elem: Aborting transition, 
action lost: [Action 76]:
Failed (id: prmApPostgreSQLDB:1_post_notify_start_0, loc: srv02, priority: 
1000000)
Feb 10 16:34:39 srv01 crmd: [4293]: info: abort_transition_graph: 
action_timer_callback:486 -
Triggered transition abort (complete=0) : Action lost
Feb 10 16:34:39 srv01 crmd: [4293]: WARN: cib_action_update: rsc_op 76:
prmApPostgreSQLDB:1_post_notify_start_0 on srv02 timed out
Feb 10 16:34:39 srv01 crmd: [4293]: info: run_graph:
====================================================
Feb 10 16:34:39 srv01 crmd: [4293]: notice: run_graph: Transition 7 
(Complete=16, Pending=0, Fired=0,
Skipped=1, Incomplete=0, Source=/var/lib/pengine/pe-input-7.bz2): Stopped
Feb 10 16:34:39 srv01 crmd: [4293]: info: te_graph_trigger: Transition 7 is now 
complete
Feb 10 16:34:39 srv01 crmd: [4293]: info: do_state_transition: State transition 
S_TRANSITION_ENGINE ->
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
---------------(snip)------------------

Will not this movement be bug?

 *  I registered the log with Bugzilla.
 * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2559

Best Regards,
Hideo Yamauchi.



_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

[Pacemaker] [Problem]post_notify_start_0 is carried out in the node that disappeared.

Reply via email to