Re: [ceph-users] nfs over rbd problem

Steve Anthony Thu, 24 Dec 2015 21:39:31 -0800

I've run into many problems trying to run RBD/NFS Pacemaker like you
describe on a two node cluster. In my case, most of the problems were a
result of a) no quorum and b) no STONITH. If you're going to be running
this setup in production, I *highly* recommend adding more nodes (if
only to maintain quorum). Without quorum, you will run into split-brain
situations where you'll see anything from both nodes starting the same
services to configuration changes disappearing when the nodes can't
agree on a recent revision.

In this case, it looks like a STONITH problem. Without a fencing
mechanism, node2 cannot be 100% certain that node1 is dead. When you
manually shutdown corosync, I suspect that as part of the process, node1
alerts the cluster that it's leaving as part of a planned shutdown. When
it just disappears, node2 can't determine what happened, it just knows
node1 isn't answering anymore. Node1 is not necessarily down, there
could just be a network problem between node1 and node2. In such a
scenario, it would be very bad for node2 to map/mount the RBDs and start
writing data while node1 is still providing the same service. Quorum can
help here too, but for node2 to be certain node1 is down, it needs an
out-of-band method to force power off node1, eg. I use IPMI on a
separate physical network.

So...add mode nodes as quorum members. You can set weight restrictions
on the resource groups to prevent them from running on these nodes if
the are not as powerful as the nodes you're using now. Then, add a
STONITH mechanism on all the nodes, and verify it works.

Once you do that, you should see things act the way you expect. Good luck!

-Steve

On 12/19/2015 03:46 AM, maoqi1982 wrote:
> Hi list
> I have a test ceph cluster include 3 nodes (node0: mon, node1: osd and
> nfs server1, node2 osd and  nfs server2).
> os :centos6.6 ,kernel :3.10.94-1.el6.elrepo.x86_64, ceph version 0.94.5
> I followed the 
> http://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/ instructions
> to setup an active/standy  NFS environment.
> when using commands " #service corosync stop or # poweroff " on node1
> ,  the fail over switch situation went fine ( nfs server take over by
> node2). But when I testing the situation of cutting off the power of
> node1, the switch is failed.
> 1. [root@node1 ~]# crm status
> Last updated: Fri Dec 18 17:14:19 2015
> Last change: Fri Dec 18 17:13:29 2015
> Stack: classic openais (with plugin)
> Current DC: node1 - partition with quorum
> Version: 1.1.11-97629de
> 2 Nodes configured, 3 expected votes
> 8 Resources configured
> Online: [ node1 node2 ]
>  Resource Group: g_rbd_share_1
>      p_rbd_map_1        (ocf::ceph:rbd.in):     Started node1
>      p_fs_rbd_1 (ocf::heartbeat:Filesystem):    Started node1
>      p_export_rbd_1     (ocf::heartbeat:exportfs):      Started node1
>      p_vip_1    (ocf::heartbeat:IPaddr):        Started node1
>  Clone Set: clo_nfs [g_nfs]
>      Started: [ node1 node2 ]
> 2. [root@node1 ~]#　service corosync stop
> [root@node2 cluster]# crm status
> Last updated: Fri Dec 18 17:14:59 2015
> Last change: Fri Dec 18 17:13:29 2015
> Stack: classic openais (with plugin)
> Current DC: node2 - partition WITHOUT quorum
> Version: 1.1.11-97629de
> 2 Nodes configured, 3 expected votes
> 8 Resources configured
> Online: [ node2 ]
> OFFLINE: [ node1 ]
>  Resource Group: g_rbd_share_1
>      p_rbd_map_1        (ocf::ceph:rbd.in):     Started node2
>      p_fs_rbd_1 (ocf::heartbeat:Filesystem):    Started node2
>      p_export_rbd_1     (ocf::heartbeat:exportfs):      Started node2
>      p_vip_1    (ocf::heartbeat:IPaddr):        Started node2
>  Clone Set: clo_nfs [g_nfs]
>      Started: [ node2 ]
>      Stopped: [ node1 ]
>
> 3. cut off node1 power manually
> [root@node2 cluster]# crm status
> Last updated: Fri Dec 18 17:23:06 2015
> Last change: Fri Dec 18 17:13:29 2015
> Stack: classic openais (with plugin)
> Current DC: node2 - partition WITHOUT quorum
> Version: 1.1.11-97629de
> 2 Nodes configured, 3 expected votes
> 8 Resources configured
> Online: [ node2 ]
> OFFLINE: [ node1 ]
>  Clone Set: clo_nfs [g_nfs]
>      Started: [ node2 ]
>      Stopped: [ node1 ]
> Failed actions:
>     p_rbd_map_1_start_0 on node2 'unknown error' (1): call=48,
> status=Timed Out, last-rc-change='Fri Dec 18 17:22:19 2015',
> queued=0ms, exec=20002ms
> corosync.log:
> Dec 18 17:22:19 corosync [pcmk  ] notice: pcmk_peer_update:
> Transitional membership event on ring 668: memb=1, new=0, lost=1
> Dec 18 17:22:19 corosync [pcmk  ] info: pcmk_peer_update: memb: node2
> 1211279552
> Dec 18 17:22:19 corosync [pcmk  ] info: pcmk_peer_update: lost: node1
> 1194502336
> Dec 18 17:22:19 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> membership event on ring 668: memb=1, new=0, lost=0
> Dec 18 17:22:19 corosync [pcmk  ] info: pcmk_peer_update: MEMB: node2
> 1211279552
> Dec 18 17:22:19 corosync [pcmk  ] info: ais_mark_unseen_peer_dead:
> Node node1 was not seen in the previous transition
> Dec 18 17:22:19 corosync [pcmk  ] info: update_member: Node
> 1194502336/node1 is now: lost
> Dec 18 17:22:19 [2695] node2       crmd:   notice:
> plugin_handle_membership:     Membership 668: quorum lost
> Dec 18 17:22:19 [2690] node2        cib:   notice:
> plugin_handle_membership:     Membership 668: quorum lost
> Dec 18 17:22:19 corosync [pcmk  ] info: send_member_notification:
> Sending membership update 668 to 3 children
> Dec 18 17:22:19 [2691] node2 stonith-ng:   notice:
> plugin_handle_membership:     Membership 668: quorum lost
> Dec 18 17:22:19 [2695] node2       crmd:     info:
> crm_update_peer_proc:     plugin_handle_membership: Node
> node1[1194502336] - unknown is now lost
> Dec 18 17:22:19 [2690] node2        cib:     info:
> crm_update_peer_proc:     plugin_handle_membership: Node
> node1[1194502336] - unknown is now lost
> Dec 18 17:22:19 corosync [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> Dec 18 17:22:19 [2695] node2       crmd:     info:
> peer_update_callback:     Client node1/peer now has status [offline]
> (DC=true, changed=111312)
> Dec 18 17:22:19 [2691] node2 stonith-ng:     info:
> crm_update_peer_proc:     plugin_handle_membership: Node
> node1[1194502336] - unknown is now lost
> Dec 18 17:22:19 [2695] node2       crmd:     info:
> peer_update_callback:     Peer node1 left us
> Dec 18 17:22:19 [2690] node2        cib:   notice:
> crm_update_peer_state:     plugin_handle_membership: Node
> node1[1194502336] - state is now lost (was member)
> Dec 18 17:22:19 [2695] node2       crmd:  warning: match_down_event:
>     No match for shutdown action on node1
> Dec 18 17:22:19 [2691] node2 stonith-ng:   notice:
> crm_update_peer_state:     plugin_handle_membership: Node
> node1[1194502336] - state is now lost (was member)
> Dec 18 17:22:19 [2695] node2       crmd:   notice:
> peer_update_callback:     Stonith/shutdown of node1 not matched
> Dec 18 17:22:19 [2695] node2       crmd:     info:
> crm_update_peer_join:     peer_update_callback: Node node1[1194502336]
> - join-2 phase 4 -> 0
> Dec 18 17:22:19 [2695] node2       crmd:     info:
> abort_transition_graph:     Transition aborted: Node failure
> (source=peer_update_callback:237, 1)
> Dec 18 17:22:19 [2695] node2       crmd:   notice:
> crm_update_peer_state:     plugin_handle_membership: Node
> node1[1194502336] - state is now lost (was member)
> Dec 18 17:22:19 [2695] node2       crmd:     info:
> peer_update_callback:     node1 is now lost (was member)
> Dec 18 17:22:19 [2695] node2       crmd:  warning: match_down_event:
>     No match for shutdown action on node1
> Dec 18 17:22:19 corosync [CPG   ] chosen downlist: sender r(0)
> ip(192.168.50.72) ; members(old:2 left:1)
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Forwarding cib_modify operation for section
> status to master (origin=local/crmd/88)
> Dec 18 17:22:19 [2695] node2       crmd:   notice:
> peer_update_callback:     Stonith/shutdown of node1 not matched
> Dec 18 17:22:19 [2695] node2       crmd:     info:
> abort_transition_graph:     Transition aborted: Node failure
> (source=peer_update_callback:237, 1)
> Dec 18 17:22:19 corosync [MAIN  ] Completed service synchronization,
> ready to provide service.
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Forwarding cib_modify operation for section
> status to master (origin=local/crmd/89)
> Dec 18 17:22:19 [2690] node2        cib:     info: cib_perform_op:    
> Diff: --- 0.69.157 2
> Dec 18 17:22:19 [2690] node2        cib:     info: cib_perform_op:    
> Diff: +++ 0.69.158 (null)
> Dec 18 17:22:19 [2690] node2        cib:     info: cib_perform_op:    
> +  /cib:  @num_updates=158
> Dec 18 17:22:19 [2690] node2        cib:     info: cib_perform_op:    
> +  /cib/status/node_state[@id='node1']:  @crmd=offline,
> @crm-debug-origin=peer_update_callback
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Completed cib_modify operation for section
> status: OK (rc=0, origin=node2/crmd/88, version=0.69.158)
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Completed cib_modify operation for section
> status: OK (rc=0, origin=node2/crmd/89, version=0.69.158)
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Forwarding cib_modify operation for section
> nodes to master (origin=local/crmd/90)
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Forwarding cib_modify operation for section
> status to master (origin=local/crmd/91)
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Forwarding cib_modify operation for section
> cib to master (origin=local/crmd/92)
> Dec 18 17:22:19 [2695] node2       crmd:     info: crmd_cs_dispatch:
>     Setting expected votes to 3
> Dec 18 17:22:19 [2695] node2       crmd:   notice:
> do_state_transition:     State transition S_IDLE -> S_INTEGRATION [
> input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=check_join_state ]
> Dec 18 17:22:19 [2695] node2       crmd:     info:
> do_dc_join_offer_one:     An unknown node joined - (re-)offer to any
> unconfirmed nodes
> Dec 18 17:22:19 [2695] node2       crmd:     info: join_make_offer:
>     Making join offers based on membership 668
> Dec 18 17:22:19 [2695] node2       crmd:     info: join_make_offer:
>     Skipping node2: already known 4
> Dec 18 17:22:19 [2695] node2       crmd:     info: join_make_offer:
>     Not making an offer to node1: not active (lost)
> Dec 18 17:22:19 [2695] node2       crmd:     info:
> abort_transition_graph:     Transition aborted: Peer Halt
> (source=do_te_invoke:158, 1)
> Dec 18 17:22:19 [2695] node2       crmd:     info:
> do_state_transition:     State transition S_INTEGRATION ->
> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL
> origin=check_join_state ]
> Dec 18 17:22:19 [2695] node2       crmd:     info:
> crmd_join_phase_log:     join-2: node2=confirmed
> Dec 18 17:22:19 [2695] node2       crmd:     info:
> crmd_join_phase_log:     join-2: node1=none
> Dec 18 17:22:19 [2695] node2       crmd:     info:
> do_state_transition:     State transition S_FINALIZE_JOIN ->
> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
> origin=peer_update_callback ]
> Dec 18 17:22:19 [2695] node2       crmd:     info:
> do_dc_join_offer_one:     An unknown node joined - (re-)offer to any
> unconfirmed nodes
> Dec 18 17:22:19 [2695] node2       crmd:     info: join_make_offer:
>     Skipping node2: already known 4
> Dec 18 17:22:19 [2695] node2       crmd:     info: join_make_offer:
>     Not making an offer to node1: not active (lost)
> Dec 18 17:22:19 [2695] node2       crmd:     info:
> do_state_transition:     State transition S_INTEGRATION ->
> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL
> origin=check_join_state ]
> Dec 18 17:22:19 [2695] node2       crmd:     info:
> crmd_join_phase_log:     join-2: node2=confirmed
> Dec 18 17:22:19 [2695] node2       crmd:     info:
> crmd_join_phase_log:     join-2: node1=none
> Dec 18 17:22:19 [2695] node2       crmd:     info:
> do_state_transition:     State transition S_FINALIZE_JOIN ->
> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL
> origin=check_join_state ]
> Dec 18 17:22:19 [2695] node2       crmd:     info:
> abort_transition_graph:     Transition aborted: Peer Cancelled
> (source=do_te_invoke:151, 1)
> Dec 18 17:22:19 [2693] node2      attrd:   notice:
> attrd_local_callback:     Sending full refresh (origin=crmd)
> Dec 18 17:22:19 [2693] node2      attrd:   notice:
> attrd_trigger_update:     Sending flush op to all hosts for: shutdown (0)
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Completed cib_modify operation for section
> nodes: OK (rc=0, origin=node2/crmd/90, version=0.69.158)
> Dec 18 17:22:19 [2690] node2        cib:     info: cib_perform_op:    
> Diff: --- 0.69.158 2
> Dec 18 17:22:19 [2690] node2        cib:     info: cib_perform_op:    
> Diff: +++ 0.69.159 (null)
> Dec 18 17:22:19 [2690] node2        cib:     info: cib_perform_op:    
> +  /cib:  @num_updates=159
> Dec 18 17:22:19 [2690] node2        cib:     info: cib_perform_op:    
> +  /cib/status/node_state[@id='node2']: 
> @crm-debug-origin=post_cache_update
> Dec 18 17:22:19 [2690] node2        cib:     info: cib_perform_op:    
> +  /cib/status/node_state[@id='node1']:  @in_ccm=false,
> @crm-debug-origin=post_cache_update
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Completed cib_modify operation for section
> status: OK (rc=0, origin=node2/crmd/91, version=0.69.159)
> Dec 18 17:22:19 [2690] node2        cib:     info: cib_perform_op:    
> Diff: --- 0.69.159 2
> Dec 18 17:22:19 [2690] node2        cib:     info: cib_perform_op:    
> Diff: +++ 0.69.160 (null)
> Dec 18 17:22:19 [2690] node2        cib:     info: cib_perform_op:    
> +  /cib:  @num_updates=160, @have-quorum=0
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Completed cib_modify operation for section
> cib: OK (rc=0, origin=node2/crmd/92, version=0.69.160)
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Forwarding cib_modify operation for section
> crm_config to master (origin=local/crmd/94)
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Forwarding cib_modify operation for section
> nodes to master (origin=local/crmd/95)
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Forwarding cib_modify operation for section
> status to master (origin=local/crmd/96)
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Forwarding cib_modify operation for section
> cib to master (origin=local/crmd/97)
> Dec 18 17:22:19 [2693] node2      attrd:   notice:
> attrd_trigger_update:     Sending flush op to all hosts for:
> probe_complete (true)
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Completed cib_modify operation for section
> crm_config: OK (rc=0, origin=node2/crmd/94, version=0.69.160)
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Completed cib_modify operation for section
> nodes: OK (rc=0, origin=node2/crmd/95, version=0.69.160)
> Dec 18 17:22:19 [2694] node2    pengine:   notice: unpack_config:    
> On loss of CCM Quorum: Ignore
> Dec 18 17:22:19 [2690] node2        cib:     info: cib_perform_op:    
> Diff: --- 0.69.160 2
> Dec 18 17:22:19 [2690] node2        cib:     info: cib_perform_op:    
> Diff: +++ 0.69.161 (null)
> Dec 18 17:22:19 [2690] node2        cib:     info: cib_perform_op:    
> +  /cib:  @num_updates=161
> Dec 18 17:22:19 [2690] node2        cib:     info: cib_perform_op:    
> +  /cib/status/node_state[@id='node2']: 
> @crm-debug-origin=do_state_transition
> Dec 18 17:22:19 [2690] node2        cib:     info: cib_perform_op:    
> +  /cib/status/node_state[@id='node1']: 
> @crm-debug-origin=do_state_transition, @join=down
> Dec 18 17:22:19 [2694] node2    pengine:     info:
> determine_online_status:     Node node2 is online
> Dec 18 17:22:19 [2694] node2    pengine:     info: group_print:    
>  Resource Group: g_rbd_share_1
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Completed cib_modify operation for section
> status: OK (rc=0, origin=node2/crmd/96, version=0.69.161)
> Dec 18 17:22:19 [2694] node2    pengine:     info: native_print:    
>      p_rbd_map_1    (ocf::ceph:rbd.in):    Stopped
> Dec 18 17:22:19 [2694] node2    pengine:     info: native_print:    
>      p_fs_rbd_1    (ocf::heartbeat:Filesystem):    Stopped
> Dec 18 17:22:19 [2694] node2    pengine:     info: native_print:    
>      p_export_rbd_1    (ocf::heartbeat:exportfs):    Stopped
> Dec 18 17:22:19 [2694] node2    pengine:     info: native_print:    
>      p_vip_1    (ocf::heartbeat:IPaddr):    Stopped
> Dec 18 17:22:19 [2694] node2    pengine:     info: clone_print:    
>  Clone Set: clo_nfs [g_nfs]
> Dec 18 17:22:19 [2694] node2    pengine:     info: short_print:    
>      Started: [ node2 ]
> Dec 18 17:22:19 [2694] node2    pengine:     info: short_print:    
>      Stopped: [ node1 ]
> Dec 18 17:22:19 [2694] node2    pengine:     info: rsc_merge_weights:
>     p_rpcbind:1: Rolling back scores from p_nfs_server:1
> Dec 18 17:22:19 [2694] node2    pengine:     info: native_color:    
> Resource p_rpcbind:1 cannot run anywhere
> Dec 18 17:22:19 [2694] node2    pengine:     info: native_color:    
> Resource p_nfs_server:1 cannot run anywhere
> Dec 18 17:22:19 [2694] node2    pengine:     info: RecurringOp:    
>  Start recurring monitor (10s) for p_rbd_map_1 on node2
> Dec 18 17:22:19 [2694] node2    pengine:     info: RecurringOp:    
>  Start recurring monitor (20s) for p_fs_rbd_1 on node2
> Dec 18 17:22:19 [2694] node2    pengine:     info: RecurringOp:    
>  Start recurring monitor (10s) for p_export_rbd_1 on node2
> Dec 18 17:22:19 [2694] node2    pengine:     info: RecurringOp:    
>  Start recurring monitor (5s) for p_vip_1 on node2
> Dec 18 17:22:19 [2694] node2    pengine:   notice: LogActions:    
> Start   p_rbd_map_1    (node2)
> Dec 18 17:22:19 [2694] node2    pengine:   notice: LogActions:    
> Start   p_fs_rbd_1    (node2)
> Dec 18 17:22:19 [2694] node2    pengine:   notice: LogActions:    
> Start   p_export_rbd_1    (node2)
> Dec 18 17:22:19 [2694] node2    pengine:   notice: LogActions:    
> Start   p_vip_1    (node2)
> Dec 18 17:22:19 [2694] node2    pengine:     info: LogActions:    
> Leave   p_rpcbind:0    (Started node2)
> Dec 18 17:22:19 [2694] node2    pengine:     info: LogActions:    
> Leave   p_nfs_server:0    (Started node2)
> Dec 18 17:22:19 [2694] node2    pengine:     info: LogActions:    
> Leave   p_rpcbind:1    (Stopped)
> Dec 18 17:22:19 [2694] node2    pengine:     info: LogActions:    
> Leave   p_nfs_server:1    (Stopped)
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Completed cib_modify operation for section
> cib: OK (rc=0, origin=node2/crmd/97, version=0.69.161)
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Forwarding cib_modify operation for section
> status to master (origin=local/attrd/24)
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Completed cib_modify operation for section
> status: OK (rc=0, origin=node2/attrd/24, version=0.69.161)
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Forwarding cib_modify operation for section
> status to master (origin=local/attrd/26)
> Dec 18 17:22:19 [2690] node2        cib:     info:
> cib_process_request:     Completed cib_modify operation for section
> status: OK (rc=0, origin=node2/attrd/26, version=0.69.161)
> Dec 18 17:22:19 [2694] node2    pengine:   notice: process_pe_message:
>     Calculated Transition 3: /var/lib/pacemaker/pengine/pe-input-234.bz2
> Dec 18 17:22:19 [2695] node2       crmd:     info:
> do_state_transition:     State transition S_POLICY_ENGINE ->
> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> origin=handle_response ]
> Dec 18 17:22:19 [2695] node2       crmd:     info: do_te_invoke:    
> Processing graph 3 (ref=pe_calc-dc-1450430539-52) derived from
> /var/lib/pacemaker/pengine/pe-input-234.bz2
> Dec 18 17:22:19 [2695] node2       crmd:   notice: te_rsc_command:    
> Initiating action 6: start p_rbd_map_1_start_0 on node2 (local)
> Dec 18 17:22:19 [2695] node2       crmd:     info: do_lrm_rsc_op:    
> Performing key=6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9
> op=p_rbd_map_1_start_0
> Dec 18 17:22:19 [2692] node2       lrmd:     info: log_execute:    
> executing - rsc:p_rbd_map_1 action:start call_id:48
> Dec 18 17:22:24 [2690] node2        cib:     info: cib_process_ping:
>     Reporting our current digest to node2:
> 29875b5ae8dcbbb1d07f904ca3423de2 for 0.69.161 (0x12607b0 0)
> Dec 18 17:22:39 [2692] node2       lrmd:  warning:
> child_timeout_callback:     p_rbd_map_1_start_0 process (PID 11010)
> timed out
> Dec 18 17:22:39 [2692] node2       lrmd:  warning: operation_finished:
>     p_rbd_map_1_start_0:11010 - timed out after 20000ms
> Dec 18 17:22:39 [2692] node2       lrmd:   notice: operation_finished:
>     p_rbd_map_1_start_0:11010:stderr [ libust[11019/11019]: Warning:
> HOME environment variable not set. Disabling LTTng-UST per-user
> tracing. (in setup_local_apps() at lttng-ust-comm.c:305) ]
> Dec 18 17:22:39 [2692] node2       lrmd:     info: log_finished:    
> finished - rsc:p_rbd_map_1 action:start call_id:48 pid:11010
> exit-code:1 exec-time:20002ms queue-time:0ms
> Dec 18 17:22:39 [2695] node2       crmd:     info:
> services_os_action_execute:     Managed rbd.in_meta-data_0 process
> 11117 exited with rc=0
> Dec 18 17:22:39 [2695] node2       crmd:    error: process_lrm_event:
>     Operation p_rbd_map_1_start_0: Timed Out (node=node2, call=48,
> timeout=20000ms)
>
> Dec 18 17:22:39 [2692] node2       lrmd:  warning:
> child_timeout_callback:     p_rbd_map_1_start_0 process (PID 11010)
> timed out
> Dec 18 17:22:39 [2692] node2       lrmd:  warning: operation_finished:
>     p_rbd_map_1_start_0:11010 - timed out after 20000ms
> Dec 18 17:22:39 [2692] node2       lrmd:   notice: operation_finished:
>     p_rbd_map_1_start_0:11010:stderr [ libust[11019/11019]: Warning:
> HOME environment variable not set. Disabling LTTng-UST per-user
> tracing. (in setup_local_apps() at lttng-ust-comm.c:305) ]
> Dec 18 17:22:39 [2692] node2       lrmd:     info: log_finished:    
> finished - rsc:p_rbd_map_1 action:start call_id:48 pid:11010
> exit-code:1 exec-time:20002ms queue-time:0ms
> Dec 18 17:22:39 [2695] node2       crmd:     info:
> services_os_action_execute:     Managed rbd.in_meta-data_0 process
> 11117 exited with rc=0
> Dec 18 17:22:39 [2695] node2       crmd:    error: process_lrm_event:
>     Operation p_rbd_map_1_start_0: Timed Out (node=node2, call=48,
> timeout=20000ms)
> Dec 18 17:22:39 [2695] node2       crmd:   notice: process_lrm_event:
>     node2-p_rbd_map_1_start_0:48 [ libust[11019/11019]: Warning: HOME
> environment variable not set. Disabling LTTng-UST per-user tracing.
> (in setup_local_apps() at lttng-ust-comm.c:305)\n ]
> Dec 18 17:22:39 [2690] node2        cib:     info:
> cib_process_request:     Forwarding cib_modify operation for section
> status to master (origin=local/crmd/99)
> Dec 18 17:22:39 [2690] node2        cib:     info: cib_perform_op:    
> Diff: --- 0.69.161 2
> Dec 18 17:22:39 [2690] node2        cib:     info: cib_perform_op:    
> Diff: +++ 0.69.162 (null)
> Dec 18 17:22:39 [2690] node2        cib:     info: cib_perform_op:    
> +  /cib:  @num_updates=162
> Dec 18 17:22:39 [2690] node2        cib:     info: cib_perform_op:    
> +  /cib/status/node_state[@id='node2']: 
> @crm-debug-origin=do_update_resource
> Dec 18 17:22:39 [2690] node2        cib:     info: cib_perform_op:    
> + 
> /cib/status/node_state[@id='node2']/lrm[@id='node2']/lrm_resources/lrm_resource[@id='p_rbd_map_1']/lrm_rsc_op[@id='p_rbd_map_1_last_0']:
>  
> @operation_key=p_rbd_map_1_start_0, @operation=start,
> @transition-key=6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9,
> @transition-magic=2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9,
> @call-id=48, @rc-code=1, @op-status=2, @last-run=1450430539,
> @last-rc-change=1450430539, @exec-time=20002
> Dec 18 17:22:39 [2690] node2        cib:     info: cib_perform_op:    
> ++
> /cib/status/node_state[@id='node2']/lrm[@id='node2']/lrm_resources/lrm_resource[@id='p_rbd_map_1']:
>  
> <lrm_rsc_op id="p_rbd_map_1_last_failure_0"
> operation_key="p_rbd_map_1_start_0" operation="start"
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"
> transition-key="6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9"
> transition-magic="2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9"
> call-id="48" rc-code="1" op-status="2" interval="0" l
> Dec 18 17:22:39 [2690] node2        cib:     info:
> cib_process_request:     Completed cib_modify operation for section
> status: OK (rc=0, origin=node2/crmd/99, version=0.69.162)
> Dec 18 17:22:39 [2695] node2       crmd:  warning: status_from_rc:    
> Action 6 (p_rbd_map_1_start_0) on node2 failed (target: 0 vs. rc: 1):
> Error
> Dec 18 17:22:39 [2695] node2       crmd:  warning: update_failcount:
>     Updating failcount for p_rbd_map_1 on node2 after failed start:
> rc=1 (update=INFINITY, time=1450430559)
> Dec 18 17:22:39 [2695] node2       crmd:   notice:
> abort_transition_graph:     Transition aborted by p_rbd_map_1_start_0
> 'modify' on node2: Event failed
> (magic=2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9, cib=0.69.162,
> source=match_graph_event:344, 0)
> Dec 18 17:22:39 [2695] node2       crmd:     info: match_graph_event:
>     Action p_rbd_map_1_start_0 (6) confirmed on node2 (rc=4)
> Dec 18 17:22:39 [2693] node2      attrd:   notice:
> attrd_trigger_update:     Sending flush op to all hosts for:
> fail-count-p_rbd_map_1 (INFINITY)
> Dec 18 17:22:39 [2695] node2       crmd:  warning: update_failcount:
>     Updating failcount for p_rbd_map_1 on node2 after failed start:
> rc=1 (update=INFINITY, time=1450430559)
> Dec 18 17:22:39 [2695] node2       crmd:     info:
> process_graph_event:     Detected action (3.6)
> p_rbd_map_1_start_0.48=unknown error: failed
> Dec 18 17:22:39 [2695] node2       crmd:  warning: status_from_rc:    
> Action 6 (p_rbd_map_1_start_0) on node2 failed (target: 0 vs. rc: 1):
> Error
> Dec 18 17:22:39 [2695] node2       crmd:  warning: update_failcount:
>     Updating failcount for p_rbd_map_1 on node2 after failed start:
> rc=1 (update=INFINITY, time=1450430559)
> Dec 18 17:22:39 [2695] node2       crmd:     info:
> abort_transition_graph:     Transition aborted by p_rbd_map_1_start_0
> 'create' on (null): Event failed
> (magic=2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9, cib=0.69.162,
> source=match_graph_event:344, 0)
> Dec 18 17:22:39 [2695] node2       crmd:     info: match_graph_event:
>     Action p_rbd_map_1_start_0 (6) confirmed on node2 (rc=4)
> Dec 18 17:22:39 [2695] node2       crmd:  warning: update_failcount:
>     Updating failcount for p_rbd_map_1 on node2 after failed start:
> rc=1 (update=INFINITY, time=1450430559)
> Dec 18 17:22:39 [2695] node2       crmd:     info:
> process_graph_event:     Detected action (3.6)
> p_rbd_map_1_start_0.48=unknown error: failed
> Dec 18 17:22:39 [2693] node2      attrd:   notice:
> attrd_perform_update:     Sent update 28: fail-count-p_rbd_map_1=INFINITY
> Dec 18 17:22:39 [2690] node2        cib:     info:
> cib_process_request:     Forwarding cib_modify operation for section
> status to master (origin=local/attrd/28)
> Dec 18 17:22:39 [2695] node2       crmd:   notice: run_graph:    
> Transition 3 (Complete=2, Pending=0, Fired=0, Skipped=8, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-234.bz2): Stopped
> Dec 18 17:22:39 [2695] node2       crmd:     info:
> do_state_transition:     State transition S_TRANSITION_ENGINE ->
> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
> origin=notify_crmd ]
> Dec 18 17:22:39 [2693] node2      attrd:   notice:
> attrd_trigger_update:     Sending flush op to all hosts for:
> last-failure-p_rbd_map_1 (1450430559)
> Dec 18 17:22:39 [2690] node2        cib:     info: cib_perform_op:    
> Diff: --- 0.69.162 2
> Dec 18 17:22:39 [2690] node2        cib:     info: cib_perform_op:    
> Diff: +++ 0.69.163 (null)
> Dec 18 17:22:39 [2690] node2        cib:     info: cib_perform_op:    
> +  /cib:  @num_updates=163
> Dec 18 17:22:39 [2690] node2        cib:     info: cib_perform_op:    
> ++
> /cib/status/node_state[@id='node2']/transient_attributes[@id='node2']/instance_attributes[@id='status-node2']:
>  
> <nvpair id="status-node2-fail-count-p_rbd_map_1"
> name="fail-count-p_rbd_map_1" value="INFINITY"/>
> .........
>
> thanks
>
>
>
>
>  
>
>
>
>  
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma...@lehigh.edu

signature.asc
Description: OpenPGP digital signature

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] nfs over rbd problem

Reply via email to