On 22/05/2013, at 9:44 PM, Brian J. Murrell <br...@interlinx.bc.ca> wrote:
> Using pacemaker 1.1.8-7 on EL6, I got the following series of events > trying to shut down pacemaker and then corosync. The corosync shutdown > (service corosync stop) ended up spinning/hanging indefinitely (~7hrs > now). The events, including a: > > May 21 23:47:18 node1 crmd[17598]: error: do_exit: Could not recover from > internal error > > For completeness, I've included the logging from the whole session, > from corosync startup until it all went haywire. The badness seems to > start at about 23:43:18. > > May 21 23:42:51 node1 corosync[17541]: [MAIN ] Corosync Cluster Engine > ('1.4.1'): started and ready to provide service. > May 21 23:42:51 node1 corosync[17541]: [MAIN ] Corosync built-in features: > nss dbus rdma snmp > May 21 23:42:51 node1 corosync[17541]: [MAIN ] Successfully read main > configuration file '/etc/corosync/corosync.conf'. > May 21 23:42:51 node1 corosync[17541]: [TOTEM ] Initializing transport > (UDP/IP Multicast). > May 21 23:42:51 node1 corosync[17541]: [TOTEM ] Initializing > transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). > May 21 23:42:51 node1 corosync[17541]: [TOTEM ] Initializing transport > (UDP/IP Multicast). > May 21 23:42:51 node1 corosync[17541]: [TOTEM ] Initializing > transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). > May 21 23:42:51 node1 corosync[17541]: [TOTEM ] The network interface > [192.168.122.253] is now up. > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: process_ais_conf: > Reading configure > May 21 23:42:51 node1 corosync[17541]: [pcmk ] ERROR: process_ais_conf: > You have configured a cluster using the Pacemaker plugin for Corosync. The > plugin is not supported in this environment and will be removed very soon. > > May 21 23:42:51 node1 corosync[17541]: [pcmk ] ERROR: process_ais_conf: > Please see Chapter 8 of 'Clusters from Scratch' > (http://www.clusterlabs.org/doc) for details on using Pacemaker with CMAN Can I take a brief moment to highlight those last two lines... And to also refer you to http://blog.clusterlabs.org/blog/2013/pacemaker-on-rhel6-dot-4/ which talks about issues with fencing for plugin-based clusters. > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: config_find_init: > Local handle: 5650605097994944513 for logging > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: config_find_next: > Processing additional logging options... > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: Found > 'off' for option: debug > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: Found > 'no' for option: to_logfile > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: Found > 'yes' for option: to_syslog > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: > Defaulting to 'daemon' for option: syslog_facility > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: config_find_init: > Local handle: 2730409743423111170 for quorum > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: config_find_next: No > additional configuration supplied for: quorum > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: No > default for option: provider > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: config_find_init: > Local handle: 5880381755227111427 for service > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: config_find_next: > Processing additional service options... > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: Found > '1' for option: ver > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: process_ais_conf: > Enabling MCP mode: Use the Pacemaker init script to complete Pacemaker startup > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: > Defaulting to 'pcmk' for option: clustername > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: > Defaulting to 'no' for option: use_logd > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: get_config_opt: > Defaulting to 'no' for option: use_mgmtd > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: pcmk_startup: CRM: > Initialized > May 21 23:42:51 node1 corosync[17541]: [pcmk ] Logging: Initialized > pcmk_startup > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: pcmk_startup: > Service: 9 > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: pcmk_startup: Local > hostname: node1 > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: pcmk_update_nodeid: > Local node id: 4252674240 > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: update_member: > Creating entry for node 4252674240 born on 0 > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: update_member: > 0x12c4640 Node 4252674240 now known as node1 (was: (null)) > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: update_member: Node > node1 now has 1 quorum votes (was 0) > May 21 23:42:51 node1 corosync[17541]: [pcmk ] info: update_member: Node > 4252674240/node1 is now: member > May 21 23:42:51 node1 corosync[17541]: [SERV ] Service engine loaded: > Pacemaker Cluster Manager 1.1.8 > May 21 23:42:51 node1 corosync[17541]: [SERV ] Service engine loaded: > corosync extended virtual synchrony service > May 21 23:42:51 node1 corosync[17541]: [SERV ] Service engine loaded: > corosync configuration service > May 21 23:42:51 node1 corosync[17541]: [SERV ] Service engine loaded: > corosync cluster closed process group service v1.01 > May 21 23:42:51 node1 corosync[17541]: [SERV ] Service engine loaded: > corosync cluster config database access v1.01 > May 21 23:42:51 node1 corosync[17541]: [SERV ] Service engine loaded: > corosync profile loading service > May 21 23:42:51 node1 corosync[17541]: [SERV ] Service engine loaded: > corosync cluster quorum service v0.1 > May 21 23:42:51 node1 corosync[17541]: [MAIN ] Compatibility mode set to > whitetank. Using V1 and V2 of the synchronization engine. > May 21 23:42:51 node1 corosync[17541]: [TOTEM ] The network interface > [10.0.0.253] is now up. > May 21 23:42:52 node1 corosync[17541]: [TOTEM ] Incrementing problem > counter for seqid 1 iface 10.0.0.253 to [1 of 10] > May 21 23:42:52 node1 corosync[17541]: [pcmk ] notice: pcmk_peer_update: > Transitional membership event on ring 4: memb=0, new=0, lost=0 > May 21 23:42:52 node1 corosync[17541]: [pcmk ] notice: pcmk_peer_update: > Stable membership event on ring 4: memb=1, new=1, lost=0 > May 21 23:42:52 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: > NEW: node1 4252674240 > May 21 23:42:52 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: > MEMB: node1 4252674240 > May 21 23:42:52 node1 corosync[17541]: [TOTEM ] A processor joined or left > the membership and a new membership was formed. > May 21 23:42:52 node1 corosync[17541]: [CPG ] chosen downlist: sender > r(0) ip(192.168.122.253) r(1) ip(10.0.0.253) ; members(old:0 left:0) > May 21 23:42:52 node1 corosync[17541]: [MAIN ] Completed service > synchronization, ready to provide service. > May 21 23:42:54 node1 corosync[17541]: [pcmk ] notice: pcmk_peer_update: > Transitional membership event on ring 8: memb=1, new=0, lost=0 > May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: > memb: node1 4252674240 > May 21 23:42:54 node1 corosync[17541]: [pcmk ] notice: pcmk_peer_update: > Stable membership event on ring 8: memb=2, new=1, lost=0 > May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: update_member: > Creating entry for node 2608507072 born on 8 > May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: update_member: Node > 2608507072/unknown is now: member > May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: > NEW: .pending. 2608507072 > May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: > MEMB: .pending. 2608507072 > May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: > MEMB: node1 4252674240 > May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: > send_member_notification: Sending membership update 8 to 0 children > May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: update_member: > 0x12c4640 Node 4252674240 ((null)) born on: 8 > May 21 23:42:54 node1 corosync[17541]: [TOTEM ] A processor joined or left > the membership and a new membership was formed. > May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: update_member: > 0x12c8b50 Node 2608507072 (node2) born on: 8 > May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: update_member: > 0x12c8b50 Node 2608507072 now known as node2 (was: (null)) > May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: update_member: Node > node2 now has 1 quorum votes (was 0) > May 21 23:42:54 node1 corosync[17541]: [pcmk ] info: > send_member_notification: Sending membership update 8 to 0 children > May 21 23:42:54 node1 corosync[17541]: [CPG ] chosen downlist: sender > r(0) ip(192.168.122.155) r(1) ip(10.0.0.155) ; members(old:1 left:0) > May 21 23:42:54 node1 corosync[17541]: [MAIN ] Completed service > synchronization, ready to provide service. > May 21 23:42:54 node1 corosync[17541]: [TOTEM ] ring 1 active with no faults > May 21 23:42:56 node1 pacemakerd[17587]: notice: main: Starting Pacemaker > 1.1.8-7.wc1.el6 (Build: 394e906): generated-manpages agent-manpages > ascii-docs ncurses libqb-logging libqb-ipc lha-fencing corosync-plugin cman > May 21 23:42:56 node1 pacemakerd[17587]: notice: get_local_node_name: > Defaulting to uname(2).nodename for the local classic openais (with plugin) > node name > May 21 23:42:56 node1 pacemakerd[17587]: notice: update_node_processes: > 0x1ec7cc0 Node 4252674240 now known as node1, was: > May 21 23:42:56 node1 stonith-ng[17594]: notice: crm_cluster_connect: > Connecting to cluster infrastructure: classic openais (with plugin) > May 21 23:42:56 node1 cib[17593]: notice: main: Using new config location: > /var/lib/pacemaker/cib > May 21 23:42:56 node1 cib[17593]: warning: retrieveCib: Cluster > configuration not found: /var/lib/pacemaker/cib/cib.xml > May 21 23:42:56 node1 cib[17593]: warning: readCibXmlFile: Primary > configuration corrupt or unusable, trying backup... > May 21 23:42:56 node1 cib[17593]: warning: readCibXmlFile: Continuing with > an empty configuration. > May 21 23:42:56 node1 attrd[17596]: notice: crm_cluster_connect: Connecting > to cluster infrastructure: classic openais (with plugin) > May 21 23:42:56 node1 corosync[17541]: [pcmk ] info: pcmk_ipc: Recorded > connection 0x12d6560 for stonith-ng/0 > May 21 23:42:56 node1 crmd[17598]: notice: main: CRM Git Version: 394e906 > May 21 23:42:56 node1 corosync[17541]: [pcmk ] info: pcmk_ipc: Recorded > connection 0x12da8e0 for attrd/0 > May 21 23:42:56 node1 attrd[17596]: notice: main: Starting mainloop... > May 21 23:42:56 node1 cib[17593]: notice: crm_cluster_connect: Connecting > to cluster infrastructure: classic openais (with plugin) > May 21 23:42:56 node1 corosync[17541]: [pcmk ] info: pcmk_ipc: Recorded > connection 0x12df4c0 for cib/0 > May 21 23:42:56 node1 corosync[17541]: [pcmk ] info: pcmk_ipc: Sending > membership update 8 to cib > May 21 23:42:56 node1 cib[17593]: notice: ais_dispatch_message: Membership > 8: quorum acquired > May 21 23:42:56 node1 cib[17593]: notice: crm_update_peer_state: > crm_update_ais_node: Node node1[4252674240] - state is now member > May 21 23:42:56 node1 cib[17593]: notice: crm_update_peer_state: > crm_update_ais_node: Node node2[2608507072] - state is now member > May 21 23:42:57 node1 pacemakerd[17587]: notice: update_node_processes: > 0x1ec8c20 Node 2608507072 now known as node2, was: > May 21 23:42:57 node1 stonith-ng[17594]: notice: setup_cib: Watching for > stonith topology changes > May 21 23:42:57 node1 crmd[17598]: notice: crm_cluster_connect: Connecting > to cluster infrastructure: classic openais (with plugin) > May 21 23:42:57 node1 corosync[17541]: [pcmk ] info: pcmk_ipc: Recorded > connection 0x12e4340 for crmd/0 > May 21 23:42:57 node1 corosync[17541]: [pcmk ] info: pcmk_ipc: Sending > membership update 8 to crmd > May 21 23:42:57 node1 crmd[17598]: notice: ais_dispatch_message: Membership > 8: quorum acquired > May 21 23:42:57 node1 crmd[17598]: notice: crm_update_peer_state: > crm_update_ais_node: Node node1[4252674240] - state is now member > May 21 23:42:57 node1 crmd[17598]: notice: crm_update_peer_state: > crm_update_ais_node: Node node2[2608507072] - state is now member > May 21 23:42:57 node1 crmd[17598]: error: crmd_ais_dispatch: Recieving > messages from a node we think is dead: node1[-42293056] > May 21 23:42:57 node1 crmd[17598]: notice: do_started: The local CRM is > operational > May 21 23:42:58 node1 cib[17593]: warning: cib_process_diff: Diff 0.0.0 -> > 0.0.1 from node2 not applied to 0.0.1: current "num_updates" is greater than > required > May 21 23:42:58 node1 crmd[17598]: error: crmd_ais_dispatch: Recieving > messages from a node we think is dead: node2[-1686460224] > May 21 23:43:01 node1 corosync[17541]: [pcmk ] notice: pcmk_peer_update: > Transitional membership event on ring 12: memb=2, new=0, lost=0 > May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: > memb: node2 2608507072 > May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: > memb: node1 4252674240 > May 21 23:43:01 node1 corosync[17541]: [pcmk ] notice: pcmk_peer_update: > Stable membership event on ring 12: memb=3, new=1, lost=0 > May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: update_member: > Creating entry for node 2474289344 born on 12 > May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: update_member: Node > 2474289344/unknown is now: member > May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: > NEW: .pending. 2474289344 > May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: > MEMB: .pending. 2474289344 > May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: > MEMB: node2 2608507072 > May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: pcmk_peer_update: > MEMB: node1 4252674240 > May 21 23:43:01 node1 corosync[17541]: [pcmk ] info: > send_member_notification: Sending membership update 12 to 2 children > May 21 23:43:01 node1 corosync[17541]: [TOTEM ] A processor joined or left > the membership and a new membership was formed. > May 21 23:43:01 node1 crmd[17598]: notice: crm_update_peer_state: > crm_update_ais_node: Node (null)[2474289344] - state is now member > May 21 23:43:01 node1 cib[17593]: notice: crm_update_peer_state: > crm_update_ais_node: Node (null)[2474289344] - state is now member > May 21 23:43:18 node1 crmd[17598]: warning: do_log: FSA: Input I_DC_TIMEOUT > from crm_timer_popped() received in state S_PENDING > May 21 23:45:18 node1 crmd[17598]: error: crm_timer_popped: Election > Timeout (I_ELECTION_DC) just popped in state S_ELECTION! (120000ms) > May 21 23:45:18 node1 crmd[17598]: notice: do_state_transition: State > transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC > cause=C_TIMER_POPPED origin=crm_timer_popped ] > May 21 23:45:18 node1 cib[17593]: notice: cib:diff: Diff: --- 0.0.3 > May 21 23:45:18 node1 cib[17593]: notice: cib:diff: Diff: +++ 0.1.1 > 2d896a081763a7b2ca54f82877fbbc84 > May 21 23:45:18 node1 cib[17593]: notice: cib:diff: -- <cib admin_epoch="0" > epoch="0" num_updates="3" /> > May 21 23:45:18 node1 cib[17593]: notice: cib:diff: ++ > <cluster_property_set id="cib-bootstrap-options" > > May 21 23:45:18 node1 cib[17593]: notice: cib:diff: ++ <nvpair > id="cib-bootstrap-options-dc-version" name="dc-version" > value="1.1.8-7.wc1.el6-394e906" /> > May 21 23:45:18 node1 cib[17593]: notice: cib:diff: ++ > </cluster_property_set> > May 21 23:45:18 node1 cib[17593]: notice: log_cib_diff: cib:diff: > Local-only Change: 0.2.1 > May 21 23:45:18 node1 cib[17593]: notice: cib:diff: -- <cib admin_epoch="0" > epoch="1" num_updates="1" /> > May 21 23:45:18 node1 cib[17593]: notice: cib:diff: ++ <nvpair > id="cib-bootstrap-options-cluster-infrastructure" > name="cluster-infrastructure" value="classic openais (with plugin)" /> > May 21 23:45:18 node1 cib[17593]: notice: log_cib_diff: cib:diff: > Local-only Change: 0.3.1 > May 21 23:45:18 node1 cib[17593]: notice: cib:diff: -- <cib admin_epoch="0" > epoch="2" num_updates="1" /> > May 21 23:45:18 node1 cib[17593]: notice: cib:diff: ++ <nvpair > id="cib-bootstrap-options-expected-quorum-votes" name="expected-quorum-votes" > value="2" /> > May 21 23:46:33 node1 pacemakerd[17587]: notice: pcmk_shutdown_worker: > Shuting down Pacemaker > May 21 23:46:33 node1 pacemakerd[17587]: notice: stop_child: Stopping crmd: > Sent -15 to process 17598 > May 21 23:46:33 node1 crmd[17598]: notice: crm_shutdown: Requesting > shutdown, upper limit is 1200000ms > May 21 23:47:18 node1 crmd[17598]: error: do_log: FSA: Input I_ERROR from > feature_update_callback() received in state S_INTEGRATION > May 21 23:47:18 node1 crmd[17598]: warning: do_state_transition: State > transition S_INTEGRATION -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL > origin=feature_update_callback ] > May 21 23:47:18 node1 crmd[17598]: error: do_recover: Action A_RECOVER > (0000000001000000) not supported > May 21 23:47:18 node1 crmd[17598]: warning: do_election_vote: Not voting in > election, we're in state S_RECOVERY > May 21 23:47:18 node1 crmd[17598]: error: do_log: FSA: Input I_TERMINATE > from do_recover() received in state S_RECOVERY > May 21 23:47:18 node1 crmd[17598]: notice: terminate_cs_connection: > Disconnecting from Corosync > May 21 23:47:18 node1 corosync[17541]: [pcmk ] info: pcmk_ipc_exit: Client > crmd (conn=0x12e4340, async-conn=0x12e4340) left > May 21 23:47:18 node1 crmd[17598]: error: do_exit: Could not recover from > internal error > May 21 23:47:18 node1 pacemakerd[17587]: error: pcmk_child_exit: Child > process crmd exited (pid=17598, rc=2) > May 21 23:47:33 node1 pacemakerd[17587]: error: send_cpg_message: Sending > message via cpg FAILED: (rc=6) Try again > May 21 23:47:33 node1 pacemakerd[17587]: notice: stop_child: Stopping > pengine: Sent -15 to process 17597 > May 21 23:47:48 node1 pacemakerd[17587]: error: send_cpg_message: Sending > message via cpg FAILED: (rc=6) Try again > May 21 23:47:48 node1 pacemakerd[17587]: notice: stop_child: Stopping > attrd: Sent -15 to process 17596 > May 21 23:47:48 node1 attrd[17596]: notice: main: Exiting... > May 21 23:47:48 node1 corosync[17541]: [pcmk ] info: pcmk_ipc_exit: Client > attrd (conn=0x12da8e0, async-conn=0x12da8e0) left > May 21 23:48:03 node1 pacemakerd[17587]: error: send_cpg_message: Sending > message via cpg FAILED: (rc=6) Try again > May 21 23:48:03 node1 pacemakerd[17587]: notice: stop_child: Stopping lrmd: > Sent -15 to process 17595 > May 21 23:48:18 node1 pacemakerd[17587]: error: send_cpg_message: Sending > message via cpg FAILED: (rc=6) Try again > May 21 23:48:18 node1 pacemakerd[17587]: notice: stop_child: Stopping > stonith-ng: Sent -15 to process 17594 > May 21 23:48:18 node1 corosync[17541]: [pcmk ] info: pcmk_ipc_exit: Client > stonith-ng (conn=0x12d6560, async-conn=0x12d6560) left > May 21 23:48:33 node1 pacemakerd[17587]: error: send_cpg_message: Sending > message via cpg FAILED: (rc=6) Try again > May 21 23:48:33 node1 pacemakerd[17587]: notice: stop_child: Stopping cib: > Sent -15 to process 17593 > May 21 23:48:38 node1 cib[17593]: notice: cib_force_exit: Forcing exit! > May 21 23:48:38 node1 corosync[17541]: [pcmk ] info: pcmk_ipc_exit: Client > cib (conn=0x12df4c0, async-conn=0x12df4c0) left > May 21 23:48:38 node1 pacemakerd[17587]: error: pcmk_child_exit: Child > process cib exited (pid=17593, rc=64) > May 21 23:48:53 node1 pacemakerd[17587]: error: send_cpg_message: Sending > message via cpg FAILED: (rc=6) Try again > May 21 23:48:53 node1 pacemakerd[17587]: notice: pcmk_shutdown_worker: > Shutdown complete > > At this point I tried to shut down corosync as described above but it > failed to shut down and just hung/spun in shutdown. It looks to me like pacemaker shut down... Is the other node in this cluster running the same version of pacemaker? > > Any ideas what went wrong here? It seems to have started with > May 21 23:42:58 node1 crmd[17598]: error: crmd_ais_dispatch: Recieving > messages from a node we think is dead: node2[-1686460224] which led to > May 21 23:45:18 node1 crmd[17598]: error: crm_timer_popped: Election > Timeout (I_ELECTION_DC) just popped in state S_ELECTION! (120000ms) and > May 21 23:47:18 node1 crmd[17598]: error: do_log: FSA: Input I_ERROR from > feature_update_callback() received in state S_INTEGRATION This is not a happy cluster. Are these real machines or VMs? Firewall config ok? Also, 1.1.8-7 was not tested with the plugin _at_all_ (and neither will future RHEL builds). I highly recommend moving to cman for RHEL6 Check out the quickstart: http://clusterlabs.org/quickstart-redhat.html _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org