Re: [Pacemaker] error: do_exit: Could not recover from internal error

Andrew Beekhof Wed, 22 May 2013 16:10:47 -0700

On 22/05/2013, at 9:44 PM, Brian J. Murrell <br...@interlinx.bc.ca> wrote:


> Using pacemaker 1.1.8-7 on EL6, I got the following series of events
> trying to shut down pacemaker and then corosync.  The corosync shutdown
> (service corosync stop) ended up spinning/hanging indefinitely (~7hrs
> now).  The events, including a:
> 
> May 21 23:47:18 node1 crmd[17598]:    error: do_exit: Could not recover from 
> internal error
> 
> For completeness, I've included the logging from the whole session,
> from corosync startup until it all went haywire.  The badness seems to
> start at about 23:43:18.
> 
> May 21 23:42:51 node1 corosync[17541]:   [MAIN  ] Corosync Cluster Engine 
> ('1.4.1'): started and ready to provide service.
> May 21 23:42:51 node1 corosync[17541]:   [MAIN  ] Corosync built-in features: 
> nss dbus rdma snmp
> May 21 23:42:51 node1 corosync[17541]:   [MAIN  ] Successfully read main 
> configuration file '/etc/corosync/corosync.conf'.
> May 21 23:42:51 node1 corosync[17541]:   [TOTEM ] Initializing transport 
> (UDP/IP Multicast).
> May 21 23:42:51 node1 corosync[17541]:   [TOTEM ] Initializing 
> transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> May 21 23:42:51 node1 corosync[17541]:   [TOTEM ] Initializing transport 
> (UDP/IP Multicast).
> May 21 23:42:51 node1 corosync[17541]:   [TOTEM ] Initializing 
> transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> May 21 23:42:51 node1 corosync[17541]:   [TOTEM ] The network interface 
> [192.168.122.253] is now up.
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: process_ais_conf: 
> Reading configure
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] ERROR: process_ais_conf: 
> You have configured a cluster using the Pacemaker plugin for Corosync. The 
> plugin is not supported in this environment and will be removed very soon.
> 
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] ERROR: process_ais_conf:  
> Please see Chapter 8 of 'Clusters from Scratch' 
> (http://www.clusterlabs.org/doc) for details on using Pacemaker with CMAN

Can I take a brief moment to highlight those last two lines...
And to also refer you to 
http://blog.clusterlabs.org/blog/2013/pacemaker-on-rhel6-dot-4/ which talks 
about issues with fencing for plugin-based clusters.

> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: config_find_init: 
> Local handle: 5650605097994944513 for logging
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: config_find_next: 
> Processing additional logging options...
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: get_config_opt: Found 
> 'off' for option: debug
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: get_config_opt: Found 
> 'no' for option: to_logfile
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: get_config_opt: Found 
> 'yes' for option: to_syslog
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: get_config_opt: 
> Defaulting to 'daemon' for option: syslog_facility
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: config_find_init: 
> Local handle: 2730409743423111170 for quorum
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: config_find_next: No 
> additional configuration supplied for: quorum
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: get_config_opt: No 
> default for option: provider
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: config_find_init: 
> Local handle: 5880381755227111427 for service
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: config_find_next: 
> Processing additional service options...
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: get_config_opt: Found 
> '1' for option: ver
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: process_ais_conf: 
> Enabling MCP mode: Use the Pacemaker init script to complete Pacemaker startup
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: get_config_opt: 
> Defaulting to 'pcmk' for option: clustername
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: get_config_opt: 
> Defaulting to 'no' for option: use_logd
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: get_config_opt: 
> Defaulting to 'no' for option: use_mgmtd
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: pcmk_startup: CRM: 
> Initialized
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] Logging: Initialized 
> pcmk_startup
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: pcmk_startup: 
> Service: 9
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: pcmk_startup: Local 
> hostname: node1
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: pcmk_update_nodeid: 
> Local node id: 4252674240
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: update_member: 
> Creating entry for node 4252674240 born on 0
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: update_member: 
> 0x12c4640 Node 4252674240 now known as node1 (was: (null))
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: update_member: Node 
> node1 now has 1 quorum votes (was 0)
> May 21 23:42:51 node1 corosync[17541]:   [pcmk  ] info: update_member: Node 
> 4252674240/node1 is now: member
> May 21 23:42:51 node1 corosync[17541]:   [SERV  ] Service engine loaded: 
> Pacemaker Cluster Manager 1.1.8
> May 21 23:42:51 node1 corosync[17541]:   [SERV  ] Service engine loaded: 
> corosync extended virtual synchrony service
> May 21 23:42:51 node1 corosync[17541]:   [SERV  ] Service engine loaded: 
> corosync configuration service
> May 21 23:42:51 node1 corosync[17541]:   [SERV  ] Service engine loaded: 
> corosync cluster closed process group service v1.01
> May 21 23:42:51 node1 corosync[17541]:   [SERV  ] Service engine loaded: 
> corosync cluster config database access v1.01
> May 21 23:42:51 node1 corosync[17541]:   [SERV  ] Service engine loaded: 
> corosync profile loading service
> May 21 23:42:51 node1 corosync[17541]:   [SERV  ] Service engine loaded: 
> corosync cluster quorum service v0.1
> May 21 23:42:51 node1 corosync[17541]:   [MAIN  ] Compatibility mode set to 
> whitetank.  Using V1 and V2 of the synchronization engine.
> May 21 23:42:51 node1 corosync[17541]:   [TOTEM ] The network interface 
> [10.0.0.253] is now up.
> May 21 23:42:52 node1 corosync[17541]:   [TOTEM ] Incrementing problem 
> counter for seqid 1 iface 10.0.0.253 to [1 of 10]
> May 21 23:42:52 node1 corosync[17541]:   [pcmk  ] notice: pcmk_peer_update: 
> Transitional membership event on ring 4: memb=0, new=0, lost=0
> May 21 23:42:52 node1 corosync[17541]:   [pcmk  ] notice: pcmk_peer_update: 
> Stable membership event on ring 4: memb=1, new=1, lost=0
> May 21 23:42:52 node1 corosync[17541]:   [pcmk  ] info: pcmk_peer_update: 
> NEW:  node1 4252674240
> May 21 23:42:52 node1 corosync[17541]:   [pcmk  ] info: pcmk_peer_update: 
> MEMB: node1 4252674240
> May 21 23:42:52 node1 corosync[17541]:   [TOTEM ] A processor joined or left 
> the membership and a new membership was formed.
> May 21 23:42:52 node1 corosync[17541]:   [CPG   ] chosen downlist: sender 
> r(0) ip(192.168.122.253) r(1) ip(10.0.0.253) ; members(old:0 left:0)
> May 21 23:42:52 node1 corosync[17541]:   [MAIN  ] Completed service 
> synchronization, ready to provide service.
> May 21 23:42:54 node1 corosync[17541]:   [pcmk  ] notice: pcmk_peer_update: 
> Transitional membership event on ring 8: memb=1, new=0, lost=0
> May 21 23:42:54 node1 corosync[17541]:   [pcmk  ] info: pcmk_peer_update: 
> memb: node1 4252674240
> May 21 23:42:54 node1 corosync[17541]:   [pcmk  ] notice: pcmk_peer_update: 
> Stable membership event on ring 8: memb=2, new=1, lost=0
> May 21 23:42:54 node1 corosync[17541]:   [pcmk  ] info: update_member: 
> Creating entry for node 2608507072 born on 8
> May 21 23:42:54 node1 corosync[17541]:   [pcmk  ] info: update_member: Node 
> 2608507072/unknown is now: member
> May 21 23:42:54 node1 corosync[17541]:   [pcmk  ] info: pcmk_peer_update: 
> NEW:  .pending. 2608507072
> May 21 23:42:54 node1 corosync[17541]:   [pcmk  ] info: pcmk_peer_update: 
> MEMB: .pending. 2608507072
> May 21 23:42:54 node1 corosync[17541]:   [pcmk  ] info: pcmk_peer_update: 
> MEMB: node1 4252674240
> May 21 23:42:54 node1 corosync[17541]:   [pcmk  ] info: 
> send_member_notification: Sending membership update 8 to 0 children
> May 21 23:42:54 node1 corosync[17541]:   [pcmk  ] info: update_member: 
> 0x12c4640 Node 4252674240 ((null)) born on: 8
> May 21 23:42:54 node1 corosync[17541]:   [TOTEM ] A processor joined or left 
> the membership and a new membership was formed.
> May 21 23:42:54 node1 corosync[17541]:   [pcmk  ] info: update_member: 
> 0x12c8b50 Node 2608507072 (node2) born on: 8
> May 21 23:42:54 node1 corosync[17541]:   [pcmk  ] info: update_member: 
> 0x12c8b50 Node 2608507072 now known as node2 (was: (null))
> May 21 23:42:54 node1 corosync[17541]:   [pcmk  ] info: update_member: Node 
> node2 now has 1 quorum votes (was 0)
> May 21 23:42:54 node1 corosync[17541]:   [pcmk  ] info: 
> send_member_notification: Sending membership update 8 to 0 children
> May 21 23:42:54 node1 corosync[17541]:   [CPG   ] chosen downlist: sender 
> r(0) ip(192.168.122.155) r(1) ip(10.0.0.155) ; members(old:1 left:0)
> May 21 23:42:54 node1 corosync[17541]:   [MAIN  ] Completed service 
> synchronization, ready to provide service.
> May 21 23:42:54 node1 corosync[17541]:   [TOTEM ] ring 1 active with no faults
> May 21 23:42:56 node1 pacemakerd[17587]:   notice: main: Starting Pacemaker 
> 1.1.8-7.wc1.el6 (Build: 394e906):  generated-manpages agent-manpages 
> ascii-docs ncurses libqb-logging libqb-ipc lha-fencing  corosync-plugin cman
> May 21 23:42:56 node1 pacemakerd[17587]:   notice: get_local_node_name: 
> Defaulting to uname(2).nodename for the local classic openais (with plugin) 
> node name
> May 21 23:42:56 node1 pacemakerd[17587]:   notice: update_node_processes: 
> 0x1ec7cc0 Node 4252674240 now known as node1, was: 
> May 21 23:42:56 node1 stonith-ng[17594]:   notice: crm_cluster_connect: 
> Connecting to cluster infrastructure: classic openais (with plugin)
> May 21 23:42:56 node1 cib[17593]:   notice: main: Using new config location: 
> /var/lib/pacemaker/cib
> May 21 23:42:56 node1 cib[17593]:  warning: retrieveCib: Cluster 
> configuration not found: /var/lib/pacemaker/cib/cib.xml
> May 21 23:42:56 node1 cib[17593]:  warning: readCibXmlFile: Primary 
> configuration corrupt or unusable, trying backup...
> May 21 23:42:56 node1 cib[17593]:  warning: readCibXmlFile: Continuing with 
> an empty configuration.
> May 21 23:42:56 node1 attrd[17596]:   notice: crm_cluster_connect: Connecting 
> to cluster infrastructure: classic openais (with plugin)
> May 21 23:42:56 node1 corosync[17541]:   [pcmk  ] info: pcmk_ipc: Recorded 
> connection 0x12d6560 for stonith-ng/0
> May 21 23:42:56 node1 crmd[17598]:   notice: main: CRM Git Version: 394e906
> May 21 23:42:56 node1 corosync[17541]:   [pcmk  ] info: pcmk_ipc: Recorded 
> connection 0x12da8e0 for attrd/0
> May 21 23:42:56 node1 attrd[17596]:   notice: main: Starting mainloop...
> May 21 23:42:56 node1 cib[17593]:   notice: crm_cluster_connect: Connecting 
> to cluster infrastructure: classic openais (with plugin)
> May 21 23:42:56 node1 corosync[17541]:   [pcmk  ] info: pcmk_ipc: Recorded 
> connection 0x12df4c0 for cib/0
> May 21 23:42:56 node1 corosync[17541]:   [pcmk  ] info: pcmk_ipc: Sending 
> membership update 8 to cib
> May 21 23:42:56 node1 cib[17593]:   notice: ais_dispatch_message: Membership 
> 8: quorum acquired
> May 21 23:42:56 node1 cib[17593]:   notice: crm_update_peer_state: 
> crm_update_ais_node: Node node1[4252674240] - state is now member
> May 21 23:42:56 node1 cib[17593]:   notice: crm_update_peer_state: 
> crm_update_ais_node: Node node2[2608507072] - state is now member
> May 21 23:42:57 node1 pacemakerd[17587]:   notice: update_node_processes: 
> 0x1ec8c20 Node 2608507072 now known as node2, was: 
> May 21 23:42:57 node1 stonith-ng[17594]:   notice: setup_cib: Watching for 
> stonith topology changes
> May 21 23:42:57 node1 crmd[17598]:   notice: crm_cluster_connect: Connecting 
> to cluster infrastructure: classic openais (with plugin)
> May 21 23:42:57 node1 corosync[17541]:   [pcmk  ] info: pcmk_ipc: Recorded 
> connection 0x12e4340 for crmd/0
> May 21 23:42:57 node1 corosync[17541]:   [pcmk  ] info: pcmk_ipc: Sending 
> membership update 8 to crmd
> May 21 23:42:57 node1 crmd[17598]:   notice: ais_dispatch_message: Membership 
> 8: quorum acquired
> May 21 23:42:57 node1 crmd[17598]:   notice: crm_update_peer_state: 
> crm_update_ais_node: Node node1[4252674240] - state is now member
> May 21 23:42:57 node1 crmd[17598]:   notice: crm_update_peer_state: 
> crm_update_ais_node: Node node2[2608507072] - state is now member
> May 21 23:42:57 node1 crmd[17598]:    error: crmd_ais_dispatch: Recieving 
> messages from a node we think is dead: node1[-42293056]
> May 21 23:42:57 node1 crmd[17598]:   notice: do_started: The local CRM is 
> operational
> May 21 23:42:58 node1 cib[17593]:  warning: cib_process_diff: Diff 0.0.0 -> 
> 0.0.1 from node2 not applied to 0.0.1: current "num_updates" is greater than 
> required
> May 21 23:42:58 node1 crmd[17598]:    error: crmd_ais_dispatch: Recieving 
> messages from a node we think is dead: node2[-1686460224]
> May 21 23:43:01 node1 corosync[17541]:   [pcmk  ] notice: pcmk_peer_update: 
> Transitional membership event on ring 12: memb=2, new=0, lost=0
> May 21 23:43:01 node1 corosync[17541]:   [pcmk  ] info: pcmk_peer_update: 
> memb: node2 2608507072
> May 21 23:43:01 node1 corosync[17541]:   [pcmk  ] info: pcmk_peer_update: 
> memb: node1 4252674240
> May 21 23:43:01 node1 corosync[17541]:   [pcmk  ] notice: pcmk_peer_update: 
> Stable membership event on ring 12: memb=3, new=1, lost=0
> May 21 23:43:01 node1 corosync[17541]:   [pcmk  ] info: update_member: 
> Creating entry for node 2474289344 born on 12
> May 21 23:43:01 node1 corosync[17541]:   [pcmk  ] info: update_member: Node 
> 2474289344/unknown is now: member
> May 21 23:43:01 node1 corosync[17541]:   [pcmk  ] info: pcmk_peer_update: 
> NEW:  .pending. 2474289344
> May 21 23:43:01 node1 corosync[17541]:   [pcmk  ] info: pcmk_peer_update: 
> MEMB: .pending. 2474289344
> May 21 23:43:01 node1 corosync[17541]:   [pcmk  ] info: pcmk_peer_update: 
> MEMB: node2 2608507072
> May 21 23:43:01 node1 corosync[17541]:   [pcmk  ] info: pcmk_peer_update: 
> MEMB: node1 4252674240
> May 21 23:43:01 node1 corosync[17541]:   [pcmk  ] info: 
> send_member_notification: Sending membership update 12 to 2 children
> May 21 23:43:01 node1 corosync[17541]:   [TOTEM ] A processor joined or left 
> the membership and a new membership was formed.
> May 21 23:43:01 node1 crmd[17598]:   notice: crm_update_peer_state: 
> crm_update_ais_node: Node (null)[2474289344] - state is now member
> May 21 23:43:01 node1 cib[17593]:   notice: crm_update_peer_state: 
> crm_update_ais_node: Node (null)[2474289344] - state is now member
> May 21 23:43:18 node1 crmd[17598]:  warning: do_log: FSA: Input I_DC_TIMEOUT 
> from crm_timer_popped() received in state S_PENDING
> May 21 23:45:18 node1 crmd[17598]:    error: crm_timer_popped: Election 
> Timeout (I_ELECTION_DC) just popped in state S_ELECTION! (120000ms)
> May 21 23:45:18 node1 crmd[17598]:   notice: do_state_transition: State 
> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC 
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> May 21 23:45:18 node1 cib[17593]:   notice: cib:diff: Diff: --- 0.0.3
> May 21 23:45:18 node1 cib[17593]:   notice: cib:diff: Diff: +++ 0.1.1 
> 2d896a081763a7b2ca54f82877fbbc84
> May 21 23:45:18 node1 cib[17593]:   notice: cib:diff: -- <cib admin_epoch="0" 
> epoch="0" num_updates="3" />
> May 21 23:45:18 node1 cib[17593]:   notice: cib:diff: ++       
> <cluster_property_set id="cib-bootstrap-options" >
> May 21 23:45:18 node1 cib[17593]:   notice: cib:diff: ++         <nvpair 
> id="cib-bootstrap-options-dc-version" name="dc-version" 
> value="1.1.8-7.wc1.el6-394e906" />
> May 21 23:45:18 node1 cib[17593]:   notice: cib:diff: ++       
> </cluster_property_set>
> May 21 23:45:18 node1 cib[17593]:   notice: log_cib_diff: cib:diff: 
> Local-only Change: 0.2.1
> May 21 23:45:18 node1 cib[17593]:   notice: cib:diff: -- <cib admin_epoch="0" 
> epoch="1" num_updates="1" />
> May 21 23:45:18 node1 cib[17593]:   notice: cib:diff: ++         <nvpair 
> id="cib-bootstrap-options-cluster-infrastructure" 
> name="cluster-infrastructure" value="classic openais (with plugin)" />
> May 21 23:45:18 node1 cib[17593]:   notice: log_cib_diff: cib:diff: 
> Local-only Change: 0.3.1
> May 21 23:45:18 node1 cib[17593]:   notice: cib:diff: -- <cib admin_epoch="0" 
> epoch="2" num_updates="1" />
> May 21 23:45:18 node1 cib[17593]:   notice: cib:diff: ++         <nvpair 
> id="cib-bootstrap-options-expected-quorum-votes" name="expected-quorum-votes" 
> value="2" />
> May 21 23:46:33 node1 pacemakerd[17587]:   notice: pcmk_shutdown_worker: 
> Shuting down Pacemaker
> May 21 23:46:33 node1 pacemakerd[17587]:   notice: stop_child: Stopping crmd: 
> Sent -15 to process 17598
> May 21 23:46:33 node1 crmd[17598]:   notice: crm_shutdown: Requesting 
> shutdown, upper limit is 1200000ms
> May 21 23:47:18 node1 crmd[17598]:    error: do_log: FSA: Input I_ERROR from 
> feature_update_callback() received in state S_INTEGRATION
> May 21 23:47:18 node1 crmd[17598]:  warning: do_state_transition: State 
> transition S_INTEGRATION -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL 
> origin=feature_update_callback ]
> May 21 23:47:18 node1 crmd[17598]:    error: do_recover: Action A_RECOVER 
> (0000000001000000) not supported
> May 21 23:47:18 node1 crmd[17598]:  warning: do_election_vote: Not voting in 
> election, we're in state S_RECOVERY
> May 21 23:47:18 node1 crmd[17598]:    error: do_log: FSA: Input I_TERMINATE 
> from do_recover() received in state S_RECOVERY
> May 21 23:47:18 node1 crmd[17598]:   notice: terminate_cs_connection: 
> Disconnecting from Corosync
> May 21 23:47:18 node1 corosync[17541]:   [pcmk  ] info: pcmk_ipc_exit: Client 
> crmd (conn=0x12e4340, async-conn=0x12e4340) left
> May 21 23:47:18 node1 crmd[17598]:    error: do_exit: Could not recover from 
> internal error
> May 21 23:47:18 node1 pacemakerd[17587]:    error: pcmk_child_exit: Child 
> process crmd exited (pid=17598, rc=2)
> May 21 23:47:33 node1 pacemakerd[17587]:    error: send_cpg_message: Sending 
> message via cpg FAILED: (rc=6) Try again
> May 21 23:47:33 node1 pacemakerd[17587]:   notice: stop_child: Stopping 
> pengine: Sent -15 to process 17597
> May 21 23:47:48 node1 pacemakerd[17587]:    error: send_cpg_message: Sending 
> message via cpg FAILED: (rc=6) Try again
> May 21 23:47:48 node1 pacemakerd[17587]:   notice: stop_child: Stopping 
> attrd: Sent -15 to process 17596
> May 21 23:47:48 node1 attrd[17596]:   notice: main: Exiting...
> May 21 23:47:48 node1 corosync[17541]:   [pcmk  ] info: pcmk_ipc_exit: Client 
> attrd (conn=0x12da8e0, async-conn=0x12da8e0) left
> May 21 23:48:03 node1 pacemakerd[17587]:    error: send_cpg_message: Sending 
> message via cpg FAILED: (rc=6) Try again
> May 21 23:48:03 node1 pacemakerd[17587]:   notice: stop_child: Stopping lrmd: 
> Sent -15 to process 17595
> May 21 23:48:18 node1 pacemakerd[17587]:    error: send_cpg_message: Sending 
> message via cpg FAILED: (rc=6) Try again
> May 21 23:48:18 node1 pacemakerd[17587]:   notice: stop_child: Stopping 
> stonith-ng: Sent -15 to process 17594
> May 21 23:48:18 node1 corosync[17541]:   [pcmk  ] info: pcmk_ipc_exit: Client 
> stonith-ng (conn=0x12d6560, async-conn=0x12d6560) left
> May 21 23:48:33 node1 pacemakerd[17587]:    error: send_cpg_message: Sending 
> message via cpg FAILED: (rc=6) Try again
> May 21 23:48:33 node1 pacemakerd[17587]:   notice: stop_child: Stopping cib: 
> Sent -15 to process 17593
> May 21 23:48:38 node1 cib[17593]:   notice: cib_force_exit: Forcing exit!
> May 21 23:48:38 node1 corosync[17541]:   [pcmk  ] info: pcmk_ipc_exit: Client 
> cib (conn=0x12df4c0, async-conn=0x12df4c0) left
> May 21 23:48:38 node1 pacemakerd[17587]:    error: pcmk_child_exit: Child 
> process cib exited (pid=17593, rc=64)
> May 21 23:48:53 node1 pacemakerd[17587]:    error: send_cpg_message: Sending 
> message via cpg FAILED: (rc=6) Try again
> May 21 23:48:53 node1 pacemakerd[17587]:   notice: pcmk_shutdown_worker: 
> Shutdown complete
> 
> At this point I tried to shut down corosync as described above but it
> failed to shut down and just hung/spun in shutdown.

It looks to me like pacemaker shut down...
Is the other node in this cluster running the same version of pacemaker?

> 
> Any ideas what went wrong here?

It seems to have started with 

> May 21 23:42:58 node1 crmd[17598]:    error: crmd_ais_dispatch: Recieving 
> messages from a node we think is dead: node2[-1686460224]

which led to

> May 21 23:45:18 node1 crmd[17598]:    error: crm_timer_popped: Election 
> Timeout (I_ELECTION_DC) just popped in state S_ELECTION! (120000ms)

and

> May 21 23:47:18 node1 crmd[17598]:    error: do_log: FSA: Input I_ERROR from 
> feature_update_callback() received in state S_INTEGRATION


This is not a happy cluster.  Are these real machines or VMs?
Firewall config ok?


Also, 1.1.8-7 was not tested with the plugin _at_all_ (and neither will future 
RHEL builds).

I highly recommend moving to cman for RHEL6
Check out the quickstart: http://clusterlabs.org/quickstart-redhat.html 


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] error: do_exit: Could not recover from internal error

Reply via email to