On 18 Jul 2014, at 12:35 pm, Emre He <emre...@gmail.com> wrote:

> Hi, 
> 
> I am working a classic corosync+pacemaker linux-HA cluster (2 servers), after 
> reboot one server, when it come back, corosync is running, pacemaker is dead. 
> 
> in corosync.log, we can see as below: 
> --------------------------------------------------------
> Jul 17 03:56:04 [2068] foo.bar.com       crmd:     info: crmd_exit:   
> Dropping I_TERMINATE: [ state=S_STOPPING cause=C_FSA_INTERNAL origin=do_stop ]
> Jul 17 03:56:04 [2068] foo.bar.com       crmd:    debug: 
> lrm_state_verify_stopped:    Checking for active resources before exit
> Jul 17 03:56:04 [2068] foo.bar.com       crmd:     info: crmd_cs_destroy:     
> connection closed
> Jul 17 03:56:04 [2068] foo.bar.com       crmd:     info: crmd_init:   
> Inhibiting automated respawn
> Jul 17 03:56:04 [2068] foo.bar.com       crmd:     info: crmd_init:   2068 
> stopped: Network is down (100)

So this is on the node that is coming back up?
Perhaps the cluster is starting too early... do you use DHCP on this node?

> Jul 17 03:56:04 [2068] foo.bar.com       crmd:  warning: crmd_fast_exit:      
> Inhibiting respawn: 100 -> 100
> Jul 17 03:56:04 [2068] foo.bar.com       crmd:     info: crm_xml_cleanup:     
> Cleaning up memory from libxml2
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug: 
> qb_ipcs_dispatch_connection_request:         HUP conn (2057-2068-14)
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug: qb_ipcs_disconnect:  
> qb_ipcs_disconnect(2057-2068-14) state:2
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:     info: crm_client_destroy:  
> Destroying 0 events
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug: qb_rb_close:         
> Free'ing ringbuffer: /dev/shm/qb-pacemakerd-response-2057-2068-14-header
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug: qb_rb_close:         
> Free'ing ringbuffer: /dev/shm/qb-pacemakerd-event-2057-2068-14-header
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug: qb_rb_close:         
> Free'ing ringbuffer: /dev/shm/qb-pacemakerd-request-2057-2068-14-header
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    error: pcmk_child_exit:     
> Child process crmd (2068) exited: Network is down (100)
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:  warning: pcmk_child_exit:     
> Pacemaker child process crmd no longer wishes to be respawned. Shutting 
> ourselves down.
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug: 
> update_node_processes:       Node foo.bar.com now has process list: 
> 00000000000000000000000000111112 (was 00000000000000000000000000111312)
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:   notice: 
> pcmk_shutdown_worker:        Shuting down Pacemaker
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug: 
> pcmk_shutdown_worker:        crmd confirmed stopped
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:   notice: stop_child:  
> Stopping pengine: Sent -15 to process 2067
> Jul 17 03:56:04 [2067] foo.bar.com    pengine:     info: crm_signal_dispatch: 
>         Invoking handler for signal 15: Terminated
> Jul 17 03:56:04 [2067] foo.bar.com    pengine:     info: qb_ipcs_us_withdraw: 
>         withdrawing server sockets
> 
> 
> Jul 17 03:56:04 [2063] foo.bar.com        cib:    debug: qb_ipcs_unref:       
> qb_ipcs_unref() - destroying
> Jul 17 03:56:04 [2063] foo.bar.com        cib:     info: crm_xml_cleanup:     
> Cleaning up memory from libxml2
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:     info: pcmk_child_exit:     
> Child process cib (2063) exited: OK (0)
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug: 
> update_node_processes:       Node foo.bar.com now has process list: 
> 00000000000000000000000000000002 (was 00000000000000000000000000000102)
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:  warning: qb_ipcs_event_sendv: 
>         new_event_notification (2057-2063-13): Broken pipe (32)
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug: 
> pcmk_shutdown_worker:        cib confirmed stopped
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:   notice: 
> pcmk_shutdown_worker:        Shutdown complete
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:   notice: 
> pcmk_shutdown_worker:        Attempting to inhibit respawning after fatal 
> error
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:     info: crm_xml_cleanup:     
> Cleaning up memory from libxml2
> Jul 17 03:56:04 corosync [CPG   ] exit_fn for conn=0x17e3a20
> Jul 17 03:56:04 corosync [pcmk  ] WARN: route_ais_message: Sending message to 
> local.stonith-ng failed: ipc delivery failed (rc=-2)
> Jul 17 03:56:04 corosync [CPG   ] got procleave message from cluster node 
> 433183754
> Jul 17 03:56:07 corosync [pcmk  ] WARN: route_ais_message: Sending message to 
> local.cib failed: ipc delivery failed (rc=-2)
> Jul 17 03:56:19 corosync [pcmk  ] WARN: route_ais_message: Sending message to 
> local.stonith-ng failed: ipc delivery failed (rc=-2)
> Jul 17 03:56:19 corosync [pcmk  ] WARN: route_ais_message: Sending message to 
> local.stonith-ng failed: ipc delivery failed (rc=-2)
> --------------------------------------------------------
> 
> here is my HA cluster parameters and package versions
> --------------------------------------------------------
> property cib-bootstrap-options: \
>         dc-version=1.1.10-1.el6_4.4-368c726 \
>         cluster-infrastructure="classic openais (with plugin)" \
>         expected-quorum-votes=2 \
>         stonith-enabled=false \
>         no-quorum-policy=ignore \
>         start-failure-is-fatal=false \
>         default-action-timeout=300s
> rsc_defaults rsc-options: \
>         resource-stickiness=100
> 
> 
> pacemaker-1.1.10-1.el6_4.4.x86_64
> corosync-1.4.1-15.el6_4.1.x86_64
> 
> --------------------------------------------------------
> 
> I am not sure if network has flash disconnection, both servers are VMware 
> VMs, but looks logs show that. 
> so is it the root cause of unexpected network issues? actually I understand 
> that's what HA should handle. 
> or any other clue about the root cause? 
> 
> many thanks, 
> Emre
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to