Hi! I had a situation when one node was periodically fenced when there was a busy network. The node bing fenced tried to restart crmd after some problem, and shortly after rejoining the cluster, it was fenced.
The messages look like this: --- Apr 5 14:14:14 h01 stonith-ng: [13076]: info: crm_new_peer: Node h05 now has id: 84939948 Apr 5 14:14:14 h01 stonith-ng: [13076]: info: crm_new_peer: Node 84939948 is now known as h05 Apr 5 14:14:14 h01 stonith-ng: [13076]: notice: remote_op_done: Operation st_fence of h01 by h05 for h05[c1ed07ad-25b2-4ea0-a168-0b667ec0dded]: OK Apr 5 14:14:14 h01 crmd: [13080]: ERROR: tengine_stonith_notify: We were alegedly just fenced by h05 for h05! Apr 5 14:14:14 h01 crmd: [13080]: ERROR: do_log: FSA: Input I_ERROR from tengine_stonith_notify() received in state S_NOT_DC Apr 5 14:14:14 h01 crmd: [13080]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=tengine_stonith_notify ] Apr 5 14:14:14 h01 crmd: [13080]: ERROR: do_recover: Action A_RECOVER (0000000001000000) not supported Apr 5 14:14:14 h01 crmd: [13080]: ERROR: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY Apr 5 14:14:14 h01 crmd: [13080]: notice: do_state_transition: State transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover ] Apr 5 14:14:14 h01 crmd: [13080]: info: do_shutdown: Disconnecting STONITH... Apr 5 14:14:14 h01 crmd: [13080]: info: tengine_stonith_connection_destroy: Fencing daemon disconnected Apr 5 14:14:14 h01 crmd: [13080]: info: lrm_connection_destroy: LRM Connection disconnected Apr 5 14:14:14 h01 crmd: [13080]: info: do_lrm_control: Disconnected from the LRM Apr 5 14:14:14 h01 crmd: [13080]: notice: terminate_ais_connection: Disconnecting from Corosync Apr 5 14:14:14 h01 crmd: [13080]: info: do_ha_control: Disconnected from OpenAIS Apr 5 14:14:14 h01 crmd: [13080]: info: do_cib_control: Disconnecting CIB Apr 5 14:14:14 h01 crmd: [13080]: info: crmd_cib_connection_destroy: Connection to the CIB terminated... Apr 5 14:14:14 h01 crmd: [13080]: info: do_exit: Performing A_EXIT_0 - gracefully exiting the CRMd Apr 5 14:14:14 h01 crmd: [13080]: ERROR: do_exit: Could not recover from internal error Apr 5 14:14:14 h01 crmd: [13080]: info: free_mem: Dropping I_TERMINATE: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ] Apr 5 14:14:14 h01 crmd: [13080]: info: crm_xml_cleanup: Cleaning up memory from libxml2 Apr 5 14:14:14 h01 crmd: [13080]: info: do_exit: [crmd] stopped (2) Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] info: pcmk_ipc_exit: Client crmd (conn=0x7322c0, async-conn=0x7322c0) left Apr 5 14:14:14 h01 cib: [13075]: WARN: send_ipc_message: IPC Channel to 13080 is not connected Apr 5 14:14:14 h01 cib: [13075]: WARN: send_via_callback_channel: Delivery of reply to client 13080/bc52dbfb-ae66-47fe-a85a-32b61a57fdf5 failed Apr 5 14:14:14 h01 cib: [13075]: WARN: do_local_notify: A-Sync reply to crmd failed: reply failed Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] ERROR: pcmk_wait_dispatch: Child process crmd exited (pid=13080, rc=2) Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] info: update_member: Node h01 now has process list: 00000000000000000000000000151112 (1380626) Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] notice: pcmk_wait_dispatch: Respawning failed child process: crmd Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] info: spawn_child: Forked child 16477 for process crmd Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] info: update_member: Node h01 now has process list: 00000000000000000000000000151312 (1381138) Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] info: update_member: Node h01 now has process list: 00000000000000000000000000151112 (1380626) Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] info: update_member: Node h01 now has process list: 00000000000000000000000000151312 (1381138) Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] info: send_member_notification: Sending membership update 1572 to 1 children Apr 5 14:14:14 h01 cib: [13075]: info: ais_dispatch_message: Membership 1572: quorum retained Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2) Apr 5 14:14:14 h01 crmd: [16477]: info: Invoked: /usr/lib64/pacemaker/crmd Apr 5 14:14:14 h01 crmd: [16477]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster Apr 5 14:14:14 h01 crmd: [16477]: notice: main: CRM Git Version: 77eeb099a504ceda05d648ed161ef8b1582c7daf Apr 5 14:14:14 h01 crmd: [16477]: info: do_cib_control: CIB connection established Apr 5 14:14:14 h01 crmd: [16477]: info: get_cluster_type: Cluster type is: 'openais' Apr 5 14:14:14 h01 crmd: [16477]: notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin) Apr 5 14:14:14 h01 crmd: [16477]: info: init_ais_connection_classic: Creating connection to our Corosync plugin Apr 5 14:14:14 h01 crmd: [16477]: info: init_ais_connection_classic: AIS connection established Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] info: pcmk_ipc: Recorded connection 0x7f48c8000e90 for crmd/16477 Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] info: pcmk_ipc: Sending membership update 1572 to crmd Apr 5 14:14:14 h01 crmd: [16477]: info: get_ais_nodeid: Server details: id=17831084 uname=h01 cname=pcmk Apr 5 14:14:14 h01 crmd: [16477]: info: init_ais_connection_once: Connection to 'classic openais (with plugin)': established Apr 5 14:14:14 h01 crmd: [16477]: info: crm_new_peer: Node h01 now has id: 17831084 Apr 5 14:14:14 h01 crmd: [16477]: info: crm_new_peer: Node 17831084 is now known as h01 Apr 5 14:14:14 h01 crmd: [16477]: info: ais_status_callback: status: h01 is now unknown Apr 5 14:14:14 h01 crmd: [16477]: info: do_ha_control: Connected to the cluster Apr 5 14:14:14 h01 crmd: [16477]: info: do_started: Delaying start, no membership data (0000000000100000) Apr 5 14:14:14 h01 crmd: [16477]: notice: ais_dispatch_message: Membership 1572: quorum acquired Apr 5 14:14:14 h01 crmd: [16477]: info: crm_new_peer: Node h05 now has id: 84939948 Apr 5 14:14:14 h01 crmd: [16477]: info: crm_new_peer: Node 84939948 is now known as h05 Apr 5 14:14:14 h01 crmd: [16477]: info: ais_status_callback: status: h05 is now unknown Apr 5 14:14:14 h01 crmd: [16477]: info: ais_status_callback: status: h05 is now member (was unknown) Apr 5 14:14:14 h01 crmd: [16477]: info: crm_update_peer: Node h05: id=84939948 state=member (new) addr=r(0) ip(172.20.16.5) r(1) ip(10.2.2.5) votes=1 born=1348 seen=1572 proc=00000000000000000000000000151312 Apr 5 14:14:14 h01 crmd: [16477]: notice: crmd_peer_update: Status update: Client h01/crmd now has status [online] (DC=<null>) Apr 5 14:14:14 h01 crmd: [16477]: info: ais_status_callback: status: h01 is now member (was unknown) Apr 5 14:14:14 h01 crmd: [16477]: info: crm_update_peer: Node h01: id=17831084 state=member (new) addr=r(0) ip(172.20.16.1) r(1) ip(10.2.2.1) (new) votes=1 (new) born=1572 seen=1572 proc=000000000000000000000000001 51312 (new) --- Anybody else? Regards, Ulrich _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
