I've run into many problems trying to run RBD/NFS Pacemaker like you describe on a two node cluster. In my case, most of the problems were a result of a) no quorum and b) no STONITH. If you're going to be running this setup in production, I *highly* recommend adding more nodes (if only to maintain quorum). Without quorum, you will run into split-brain situations where you'll see anything from both nodes starting the same services to configuration changes disappearing when the nodes can't agree on a recent revision.
In this case, it looks like a STONITH problem. Without a fencing mechanism, node2 cannot be 100% certain that node1 is dead. When you manually shutdown corosync, I suspect that as part of the process, node1 alerts the cluster that it's leaving as part of a planned shutdown. When it just disappears, node2 can't determine what happened, it just knows node1 isn't answering anymore. Node1 is not necessarily down, there could just be a network problem between node1 and node2. In such a scenario, it would be very bad for node2 to map/mount the RBDs and start writing data while node1 is still providing the same service. Quorum can help here too, but for node2 to be certain node1 is down, it needs an out-of-band method to force power off node1, eg. I use IPMI on a separate physical network. So...add mode nodes as quorum members. You can set weight restrictions on the resource groups to prevent them from running on these nodes if the are not as powerful as the nodes you're using now. Then, add a STONITH mechanism on all the nodes, and verify it works. Once you do that, you should see things act the way you expect. Good luck! -Steve On 12/19/2015 03:46 AM, maoqi1982 wrote: > Hi list > I have a test ceph cluster include 3 nodes (node0: mon, node1: osd and > nfs server1, node2 osd and nfs server2). > os :centos6.6 ,kernel :3.10.94-1.el6.elrepo.x86_64, ceph version 0.94.5 > I followed the > http://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/ instructions > to setup an active/standy NFS environment. > when using commands " #service corosync stop or # poweroff " on node1 > , the fail over switch situation went fine ( nfs server take over by > node2). But when I testing the situation of cutting off the power of > node1, the switch is failed. > 1. [root@node1 ~]# crm status > Last updated: Fri Dec 18 17:14:19 2015 > Last change: Fri Dec 18 17:13:29 2015 > Stack: classic openais (with plugin) > Current DC: node1 - partition with quorum > Version: 1.1.11-97629de > 2 Nodes configured, 3 expected votes > 8 Resources configured > Online: [ node1 node2 ] > Resource Group: g_rbd_share_1 > p_rbd_map_1 (ocf::ceph:rbd.in): Started node1 > p_fs_rbd_1 (ocf::heartbeat:Filesystem): Started node1 > p_export_rbd_1 (ocf::heartbeat:exportfs): Started node1 > p_vip_1 (ocf::heartbeat:IPaddr): Started node1 > Clone Set: clo_nfs [g_nfs] > Started: [ node1 node2 ] > 2. [root@node1 ~]# service corosync stop > [root@node2 cluster]# crm status > Last updated: Fri Dec 18 17:14:59 2015 > Last change: Fri Dec 18 17:13:29 2015 > Stack: classic openais (with plugin) > Current DC: node2 - partition WITHOUT quorum > Version: 1.1.11-97629de > 2 Nodes configured, 3 expected votes > 8 Resources configured > Online: [ node2 ] > OFFLINE: [ node1 ] > Resource Group: g_rbd_share_1 > p_rbd_map_1 (ocf::ceph:rbd.in): Started node2 > p_fs_rbd_1 (ocf::heartbeat:Filesystem): Started node2 > p_export_rbd_1 (ocf::heartbeat:exportfs): Started node2 > p_vip_1 (ocf::heartbeat:IPaddr): Started node2 > Clone Set: clo_nfs [g_nfs] > Started: [ node2 ] > Stopped: [ node1 ] > > 3. cut off node1 power manually > [root@node2 cluster]# crm status > Last updated: Fri Dec 18 17:23:06 2015 > Last change: Fri Dec 18 17:13:29 2015 > Stack: classic openais (with plugin) > Current DC: node2 - partition WITHOUT quorum > Version: 1.1.11-97629de > 2 Nodes configured, 3 expected votes > 8 Resources configured > Online: [ node2 ] > OFFLINE: [ node1 ] > Clone Set: clo_nfs [g_nfs] > Started: [ node2 ] > Stopped: [ node1 ] > Failed actions: > p_rbd_map_1_start_0 on node2 'unknown error' (1): call=48, > status=Timed Out, last-rc-change='Fri Dec 18 17:22:19 2015', > queued=0ms, exec=20002ms > corosync.log: > Dec 18 17:22:19 corosync [pcmk ] notice: pcmk_peer_update: > Transitional membership event on ring 668: memb=1, new=0, lost=1 > Dec 18 17:22:19 corosync [pcmk ] info: pcmk_peer_update: memb: node2 > 1211279552 > Dec 18 17:22:19 corosync [pcmk ] info: pcmk_peer_update: lost: node1 > 1194502336 > Dec 18 17:22:19 corosync [pcmk ] notice: pcmk_peer_update: Stable > membership event on ring 668: memb=1, new=0, lost=0 > Dec 18 17:22:19 corosync [pcmk ] info: pcmk_peer_update: MEMB: node2 > 1211279552 > Dec 18 17:22:19 corosync [pcmk ] info: ais_mark_unseen_peer_dead: > Node node1 was not seen in the previous transition > Dec 18 17:22:19 corosync [pcmk ] info: update_member: Node > 1194502336/node1 is now: lost > Dec 18 17:22:19 [2695] node2 crmd: notice: > plugin_handle_membership: Membership 668: quorum lost > Dec 18 17:22:19 [2690] node2 cib: notice: > plugin_handle_membership: Membership 668: quorum lost > Dec 18 17:22:19 corosync [pcmk ] info: send_member_notification: > Sending membership update 668 to 3 children > Dec 18 17:22:19 [2691] node2 stonith-ng: notice: > plugin_handle_membership: Membership 668: quorum lost > Dec 18 17:22:19 [2695] node2 crmd: info: > crm_update_peer_proc: plugin_handle_membership: Node > node1[1194502336] - unknown is now lost > Dec 18 17:22:19 [2690] node2 cib: info: > crm_update_peer_proc: plugin_handle_membership: Node > node1[1194502336] - unknown is now lost > Dec 18 17:22:19 corosync [TOTEM ] A processor joined or left the > membership and a new membership was formed. > Dec 18 17:22:19 [2695] node2 crmd: info: > peer_update_callback: Client node1/peer now has status [offline] > (DC=true, changed=111312) > Dec 18 17:22:19 [2691] node2 stonith-ng: info: > crm_update_peer_proc: plugin_handle_membership: Node > node1[1194502336] - unknown is now lost > Dec 18 17:22:19 [2695] node2 crmd: info: > peer_update_callback: Peer node1 left us > Dec 18 17:22:19 [2690] node2 cib: notice: > crm_update_peer_state: plugin_handle_membership: Node > node1[1194502336] - state is now lost (was member) > Dec 18 17:22:19 [2695] node2 crmd: warning: match_down_event: > No match for shutdown action on node1 > Dec 18 17:22:19 [2691] node2 stonith-ng: notice: > crm_update_peer_state: plugin_handle_membership: Node > node1[1194502336] - state is now lost (was member) > Dec 18 17:22:19 [2695] node2 crmd: notice: > peer_update_callback: Stonith/shutdown of node1 not matched > Dec 18 17:22:19 [2695] node2 crmd: info: > crm_update_peer_join: peer_update_callback: Node node1[1194502336] > - join-2 phase 4 -> 0 > Dec 18 17:22:19 [2695] node2 crmd: info: > abort_transition_graph: Transition aborted: Node failure > (source=peer_update_callback:237, 1) > Dec 18 17:22:19 [2695] node2 crmd: notice: > crm_update_peer_state: plugin_handle_membership: Node > node1[1194502336] - state is now lost (was member) > Dec 18 17:22:19 [2695] node2 crmd: info: > peer_update_callback: node1 is now lost (was member) > Dec 18 17:22:19 [2695] node2 crmd: warning: match_down_event: > No match for shutdown action on node1 > Dec 18 17:22:19 corosync [CPG ] chosen downlist: sender r(0) > ip(192.168.50.72) ; members(old:2 left:1) > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Forwarding cib_modify operation for section > status to master (origin=local/crmd/88) > Dec 18 17:22:19 [2695] node2 crmd: notice: > peer_update_callback: Stonith/shutdown of node1 not matched > Dec 18 17:22:19 [2695] node2 crmd: info: > abort_transition_graph: Transition aborted: Node failure > (source=peer_update_callback:237, 1) > Dec 18 17:22:19 corosync [MAIN ] Completed service synchronization, > ready to provide service. > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Forwarding cib_modify operation for section > status to master (origin=local/crmd/89) > Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: > Diff: --- 0.69.157 2 > Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: > Diff: +++ 0.69.158 (null) > Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: > + /cib: @num_updates=158 > Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: > + /cib/status/node_state[@id='node1']: @crmd=offline, > @crm-debug-origin=peer_update_callback > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Completed cib_modify operation for section > status: OK (rc=0, origin=node2/crmd/88, version=0.69.158) > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Completed cib_modify operation for section > status: OK (rc=0, origin=node2/crmd/89, version=0.69.158) > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Forwarding cib_modify operation for section > nodes to master (origin=local/crmd/90) > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Forwarding cib_modify operation for section > status to master (origin=local/crmd/91) > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Forwarding cib_modify operation for section > cib to master (origin=local/crmd/92) > Dec 18 17:22:19 [2695] node2 crmd: info: crmd_cs_dispatch: > Setting expected votes to 3 > Dec 18 17:22:19 [2695] node2 crmd: notice: > do_state_transition: State transition S_IDLE -> S_INTEGRATION [ > input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=check_join_state ] > Dec 18 17:22:19 [2695] node2 crmd: info: > do_dc_join_offer_one: An unknown node joined - (re-)offer to any > unconfirmed nodes > Dec 18 17:22:19 [2695] node2 crmd: info: join_make_offer: > Making join offers based on membership 668 > Dec 18 17:22:19 [2695] node2 crmd: info: join_make_offer: > Skipping node2: already known 4 > Dec 18 17:22:19 [2695] node2 crmd: info: join_make_offer: > Not making an offer to node1: not active (lost) > Dec 18 17:22:19 [2695] node2 crmd: info: > abort_transition_graph: Transition aborted: Peer Halt > (source=do_te_invoke:158, 1) > Dec 18 17:22:19 [2695] node2 crmd: info: > do_state_transition: State transition S_INTEGRATION -> > S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL > origin=check_join_state ] > Dec 18 17:22:19 [2695] node2 crmd: info: > crmd_join_phase_log: join-2: node2=confirmed > Dec 18 17:22:19 [2695] node2 crmd: info: > crmd_join_phase_log: join-2: node1=none > Dec 18 17:22:19 [2695] node2 crmd: info: > do_state_transition: State transition S_FINALIZE_JOIN -> > S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL > origin=peer_update_callback ] > Dec 18 17:22:19 [2695] node2 crmd: info: > do_dc_join_offer_one: An unknown node joined - (re-)offer to any > unconfirmed nodes > Dec 18 17:22:19 [2695] node2 crmd: info: join_make_offer: > Skipping node2: already known 4 > Dec 18 17:22:19 [2695] node2 crmd: info: join_make_offer: > Not making an offer to node1: not active (lost) > Dec 18 17:22:19 [2695] node2 crmd: info: > do_state_transition: State transition S_INTEGRATION -> > S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL > origin=check_join_state ] > Dec 18 17:22:19 [2695] node2 crmd: info: > crmd_join_phase_log: join-2: node2=confirmed > Dec 18 17:22:19 [2695] node2 crmd: info: > crmd_join_phase_log: join-2: node1=none > Dec 18 17:22:19 [2695] node2 crmd: info: > do_state_transition: State transition S_FINALIZE_JOIN -> > S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL > origin=check_join_state ] > Dec 18 17:22:19 [2695] node2 crmd: info: > abort_transition_graph: Transition aborted: Peer Cancelled > (source=do_te_invoke:151, 1) > Dec 18 17:22:19 [2693] node2 attrd: notice: > attrd_local_callback: Sending full refresh (origin=crmd) > Dec 18 17:22:19 [2693] node2 attrd: notice: > attrd_trigger_update: Sending flush op to all hosts for: shutdown (0) > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Completed cib_modify operation for section > nodes: OK (rc=0, origin=node2/crmd/90, version=0.69.158) > Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: > Diff: --- 0.69.158 2 > Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: > Diff: +++ 0.69.159 (null) > Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: > + /cib: @num_updates=159 > Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: > + /cib/status/node_state[@id='node2']: > @crm-debug-origin=post_cache_update > Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: > + /cib/status/node_state[@id='node1']: @in_ccm=false, > @crm-debug-origin=post_cache_update > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Completed cib_modify operation for section > status: OK (rc=0, origin=node2/crmd/91, version=0.69.159) > Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: > Diff: --- 0.69.159 2 > Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: > Diff: +++ 0.69.160 (null) > Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: > + /cib: @num_updates=160, @have-quorum=0 > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Completed cib_modify operation for section > cib: OK (rc=0, origin=node2/crmd/92, version=0.69.160) > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Forwarding cib_modify operation for section > crm_config to master (origin=local/crmd/94) > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Forwarding cib_modify operation for section > nodes to master (origin=local/crmd/95) > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Forwarding cib_modify operation for section > status to master (origin=local/crmd/96) > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Forwarding cib_modify operation for section > cib to master (origin=local/crmd/97) > Dec 18 17:22:19 [2693] node2 attrd: notice: > attrd_trigger_update: Sending flush op to all hosts for: > probe_complete (true) > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Completed cib_modify operation for section > crm_config: OK (rc=0, origin=node2/crmd/94, version=0.69.160) > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Completed cib_modify operation for section > nodes: OK (rc=0, origin=node2/crmd/95, version=0.69.160) > Dec 18 17:22:19 [2694] node2 pengine: notice: unpack_config: > On loss of CCM Quorum: Ignore > Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: > Diff: --- 0.69.160 2 > Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: > Diff: +++ 0.69.161 (null) > Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: > + /cib: @num_updates=161 > Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: > + /cib/status/node_state[@id='node2']: > @crm-debug-origin=do_state_transition > Dec 18 17:22:19 [2690] node2 cib: info: cib_perform_op: > + /cib/status/node_state[@id='node1']: > @crm-debug-origin=do_state_transition, @join=down > Dec 18 17:22:19 [2694] node2 pengine: info: > determine_online_status: Node node2 is online > Dec 18 17:22:19 [2694] node2 pengine: info: group_print: > Resource Group: g_rbd_share_1 > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Completed cib_modify operation for section > status: OK (rc=0, origin=node2/crmd/96, version=0.69.161) > Dec 18 17:22:19 [2694] node2 pengine: info: native_print: > p_rbd_map_1 (ocf::ceph:rbd.in): Stopped > Dec 18 17:22:19 [2694] node2 pengine: info: native_print: > p_fs_rbd_1 (ocf::heartbeat:Filesystem): Stopped > Dec 18 17:22:19 [2694] node2 pengine: info: native_print: > p_export_rbd_1 (ocf::heartbeat:exportfs): Stopped > Dec 18 17:22:19 [2694] node2 pengine: info: native_print: > p_vip_1 (ocf::heartbeat:IPaddr): Stopped > Dec 18 17:22:19 [2694] node2 pengine: info: clone_print: > Clone Set: clo_nfs [g_nfs] > Dec 18 17:22:19 [2694] node2 pengine: info: short_print: > Started: [ node2 ] > Dec 18 17:22:19 [2694] node2 pengine: info: short_print: > Stopped: [ node1 ] > Dec 18 17:22:19 [2694] node2 pengine: info: rsc_merge_weights: > p_rpcbind:1: Rolling back scores from p_nfs_server:1 > Dec 18 17:22:19 [2694] node2 pengine: info: native_color: > Resource p_rpcbind:1 cannot run anywhere > Dec 18 17:22:19 [2694] node2 pengine: info: native_color: > Resource p_nfs_server:1 cannot run anywhere > Dec 18 17:22:19 [2694] node2 pengine: info: RecurringOp: > Start recurring monitor (10s) for p_rbd_map_1 on node2 > Dec 18 17:22:19 [2694] node2 pengine: info: RecurringOp: > Start recurring monitor (20s) for p_fs_rbd_1 on node2 > Dec 18 17:22:19 [2694] node2 pengine: info: RecurringOp: > Start recurring monitor (10s) for p_export_rbd_1 on node2 > Dec 18 17:22:19 [2694] node2 pengine: info: RecurringOp: > Start recurring monitor (5s) for p_vip_1 on node2 > Dec 18 17:22:19 [2694] node2 pengine: notice: LogActions: > Start p_rbd_map_1 (node2) > Dec 18 17:22:19 [2694] node2 pengine: notice: LogActions: > Start p_fs_rbd_1 (node2) > Dec 18 17:22:19 [2694] node2 pengine: notice: LogActions: > Start p_export_rbd_1 (node2) > Dec 18 17:22:19 [2694] node2 pengine: notice: LogActions: > Start p_vip_1 (node2) > Dec 18 17:22:19 [2694] node2 pengine: info: LogActions: > Leave p_rpcbind:0 (Started node2) > Dec 18 17:22:19 [2694] node2 pengine: info: LogActions: > Leave p_nfs_server:0 (Started node2) > Dec 18 17:22:19 [2694] node2 pengine: info: LogActions: > Leave p_rpcbind:1 (Stopped) > Dec 18 17:22:19 [2694] node2 pengine: info: LogActions: > Leave p_nfs_server:1 (Stopped) > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Completed cib_modify operation for section > cib: OK (rc=0, origin=node2/crmd/97, version=0.69.161) > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Forwarding cib_modify operation for section > status to master (origin=local/attrd/24) > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Completed cib_modify operation for section > status: OK (rc=0, origin=node2/attrd/24, version=0.69.161) > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Forwarding cib_modify operation for section > status to master (origin=local/attrd/26) > Dec 18 17:22:19 [2690] node2 cib: info: > cib_process_request: Completed cib_modify operation for section > status: OK (rc=0, origin=node2/attrd/26, version=0.69.161) > Dec 18 17:22:19 [2694] node2 pengine: notice: process_pe_message: > Calculated Transition 3: /var/lib/pacemaker/pengine/pe-input-234.bz2 > Dec 18 17:22:19 [2695] node2 crmd: info: > do_state_transition: State transition S_POLICY_ENGINE -> > S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE > origin=handle_response ] > Dec 18 17:22:19 [2695] node2 crmd: info: do_te_invoke: > Processing graph 3 (ref=pe_calc-dc-1450430539-52) derived from > /var/lib/pacemaker/pengine/pe-input-234.bz2 > Dec 18 17:22:19 [2695] node2 crmd: notice: te_rsc_command: > Initiating action 6: start p_rbd_map_1_start_0 on node2 (local) > Dec 18 17:22:19 [2695] node2 crmd: info: do_lrm_rsc_op: > Performing key=6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9 > op=p_rbd_map_1_start_0 > Dec 18 17:22:19 [2692] node2 lrmd: info: log_execute: > executing - rsc:p_rbd_map_1 action:start call_id:48 > Dec 18 17:22:24 [2690] node2 cib: info: cib_process_ping: > Reporting our current digest to node2: > 29875b5ae8dcbbb1d07f904ca3423de2 for 0.69.161 (0x12607b0 0) > Dec 18 17:22:39 [2692] node2 lrmd: warning: > child_timeout_callback: p_rbd_map_1_start_0 process (PID 11010) > timed out > Dec 18 17:22:39 [2692] node2 lrmd: warning: operation_finished: > p_rbd_map_1_start_0:11010 - timed out after 20000ms > Dec 18 17:22:39 [2692] node2 lrmd: notice: operation_finished: > p_rbd_map_1_start_0:11010:stderr [ libust[11019/11019]: Warning: > HOME environment variable not set. Disabling LTTng-UST per-user > tracing. (in setup_local_apps() at lttng-ust-comm.c:305) ] > Dec 18 17:22:39 [2692] node2 lrmd: info: log_finished: > finished - rsc:p_rbd_map_1 action:start call_id:48 pid:11010 > exit-code:1 exec-time:20002ms queue-time:0ms > Dec 18 17:22:39 [2695] node2 crmd: info: > services_os_action_execute: Managed rbd.in_meta-data_0 process > 11117 exited with rc=0 > Dec 18 17:22:39 [2695] node2 crmd: error: process_lrm_event: > Operation p_rbd_map_1_start_0: Timed Out (node=node2, call=48, > timeout=20000ms) > > Dec 18 17:22:39 [2692] node2 lrmd: warning: > child_timeout_callback: p_rbd_map_1_start_0 process (PID 11010) > timed out > Dec 18 17:22:39 [2692] node2 lrmd: warning: operation_finished: > p_rbd_map_1_start_0:11010 - timed out after 20000ms > Dec 18 17:22:39 [2692] node2 lrmd: notice: operation_finished: > p_rbd_map_1_start_0:11010:stderr [ libust[11019/11019]: Warning: > HOME environment variable not set. Disabling LTTng-UST per-user > tracing. (in setup_local_apps() at lttng-ust-comm.c:305) ] > Dec 18 17:22:39 [2692] node2 lrmd: info: log_finished: > finished - rsc:p_rbd_map_1 action:start call_id:48 pid:11010 > exit-code:1 exec-time:20002ms queue-time:0ms > Dec 18 17:22:39 [2695] node2 crmd: info: > services_os_action_execute: Managed rbd.in_meta-data_0 process > 11117 exited with rc=0 > Dec 18 17:22:39 [2695] node2 crmd: error: process_lrm_event: > Operation p_rbd_map_1_start_0: Timed Out (node=node2, call=48, > timeout=20000ms) > Dec 18 17:22:39 [2695] node2 crmd: notice: process_lrm_event: > node2-p_rbd_map_1_start_0:48 [ libust[11019/11019]: Warning: HOME > environment variable not set. Disabling LTTng-UST per-user tracing. > (in setup_local_apps() at lttng-ust-comm.c:305)\n ] > Dec 18 17:22:39 [2690] node2 cib: info: > cib_process_request: Forwarding cib_modify operation for section > status to master (origin=local/crmd/99) > Dec 18 17:22:39 [2690] node2 cib: info: cib_perform_op: > Diff: --- 0.69.161 2 > Dec 18 17:22:39 [2690] node2 cib: info: cib_perform_op: > Diff: +++ 0.69.162 (null) > Dec 18 17:22:39 [2690] node2 cib: info: cib_perform_op: > + /cib: @num_updates=162 > Dec 18 17:22:39 [2690] node2 cib: info: cib_perform_op: > + /cib/status/node_state[@id='node2']: > @crm-debug-origin=do_update_resource > Dec 18 17:22:39 [2690] node2 cib: info: cib_perform_op: > + > /cib/status/node_state[@id='node2']/lrm[@id='node2']/lrm_resources/lrm_resource[@id='p_rbd_map_1']/lrm_rsc_op[@id='p_rbd_map_1_last_0']: > > @operation_key=p_rbd_map_1_start_0, @operation=start, > @transition-key=6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9, > @transition-magic=2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9, > @call-id=48, @rc-code=1, @op-status=2, @last-run=1450430539, > @last-rc-change=1450430539, @exec-time=20002 > Dec 18 17:22:39 [2690] node2 cib: info: cib_perform_op: > ++ > /cib/status/node_state[@id='node2']/lrm[@id='node2']/lrm_resources/lrm_resource[@id='p_rbd_map_1']: > > <lrm_rsc_op id="p_rbd_map_1_last_failure_0" > operation_key="p_rbd_map_1_start_0" operation="start" > crm-debug-origin="do_update_resource" crm_feature_set="3.0.9" > transition-key="6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9" > transition-magic="2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9" > call-id="48" rc-code="1" op-status="2" interval="0" l > Dec 18 17:22:39 [2690] node2 cib: info: > cib_process_request: Completed cib_modify operation for section > status: OK (rc=0, origin=node2/crmd/99, version=0.69.162) > Dec 18 17:22:39 [2695] node2 crmd: warning: status_from_rc: > Action 6 (p_rbd_map_1_start_0) on node2 failed (target: 0 vs. rc: 1): > Error > Dec 18 17:22:39 [2695] node2 crmd: warning: update_failcount: > Updating failcount for p_rbd_map_1 on node2 after failed start: > rc=1 (update=INFINITY, time=1450430559) > Dec 18 17:22:39 [2695] node2 crmd: notice: > abort_transition_graph: Transition aborted by p_rbd_map_1_start_0 > 'modify' on node2: Event failed > (magic=2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9, cib=0.69.162, > source=match_graph_event:344, 0) > Dec 18 17:22:39 [2695] node2 crmd: info: match_graph_event: > Action p_rbd_map_1_start_0 (6) confirmed on node2 (rc=4) > Dec 18 17:22:39 [2693] node2 attrd: notice: > attrd_trigger_update: Sending flush op to all hosts for: > fail-count-p_rbd_map_1 (INFINITY) > Dec 18 17:22:39 [2695] node2 crmd: warning: update_failcount: > Updating failcount for p_rbd_map_1 on node2 after failed start: > rc=1 (update=INFINITY, time=1450430559) > Dec 18 17:22:39 [2695] node2 crmd: info: > process_graph_event: Detected action (3.6) > p_rbd_map_1_start_0.48=unknown error: failed > Dec 18 17:22:39 [2695] node2 crmd: warning: status_from_rc: > Action 6 (p_rbd_map_1_start_0) on node2 failed (target: 0 vs. rc: 1): > Error > Dec 18 17:22:39 [2695] node2 crmd: warning: update_failcount: > Updating failcount for p_rbd_map_1 on node2 after failed start: > rc=1 (update=INFINITY, time=1450430559) > Dec 18 17:22:39 [2695] node2 crmd: info: > abort_transition_graph: Transition aborted by p_rbd_map_1_start_0 > 'create' on (null): Event failed > (magic=2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9, cib=0.69.162, > source=match_graph_event:344, 0) > Dec 18 17:22:39 [2695] node2 crmd: info: match_graph_event: > Action p_rbd_map_1_start_0 (6) confirmed on node2 (rc=4) > Dec 18 17:22:39 [2695] node2 crmd: warning: update_failcount: > Updating failcount for p_rbd_map_1 on node2 after failed start: > rc=1 (update=INFINITY, time=1450430559) > Dec 18 17:22:39 [2695] node2 crmd: info: > process_graph_event: Detected action (3.6) > p_rbd_map_1_start_0.48=unknown error: failed > Dec 18 17:22:39 [2693] node2 attrd: notice: > attrd_perform_update: Sent update 28: fail-count-p_rbd_map_1=INFINITY > Dec 18 17:22:39 [2690] node2 cib: info: > cib_process_request: Forwarding cib_modify operation for section > status to master (origin=local/attrd/28) > Dec 18 17:22:39 [2695] node2 crmd: notice: run_graph: > Transition 3 (Complete=2, Pending=0, Fired=0, Skipped=8, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-234.bz2): Stopped > Dec 18 17:22:39 [2695] node2 crmd: info: > do_state_transition: State transition S_TRANSITION_ENGINE -> > S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL > origin=notify_crmd ] > Dec 18 17:22:39 [2693] node2 attrd: notice: > attrd_trigger_update: Sending flush op to all hosts for: > last-failure-p_rbd_map_1 (1450430559) > Dec 18 17:22:39 [2690] node2 cib: info: cib_perform_op: > Diff: --- 0.69.162 2 > Dec 18 17:22:39 [2690] node2 cib: info: cib_perform_op: > Diff: +++ 0.69.163 (null) > Dec 18 17:22:39 [2690] node2 cib: info: cib_perform_op: > + /cib: @num_updates=163 > Dec 18 17:22:39 [2690] node2 cib: info: cib_perform_op: > ++ > /cib/status/node_state[@id='node2']/transient_attributes[@id='node2']/instance_attributes[@id='status-node2']: > > <nvpair id="status-node2-fail-count-p_rbd_map_1" > name="fail-count-p_rbd_map_1" value="INFINITY"/> > ......... > > thanks > > > > > > > > > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com