On 10/04/2013, at 11:54 PM, Ulrich Windl <[email protected]> wrote:
> Hi! > > I had a situation when one node was periodically fenced when there was a busy > network. The node bing fenced tried to restart crmd after some problem, and > shortly after rejoining the cluster, it was fenced. The message "Apr 5 14:14:14 h01 crmd: [13080]: ERROR: do_recover: Action A_RECOVER (0000000001000000) not supported" is normal but should really be changed as it is misleading. The "real" error is above it: > Apr 5 14:14:14 h01 crmd: [13080]: ERROR: tengine_stonith_notify: We were > alegedly just fenced by h05 for h05! The rest is pacemaker saying "holly heck" and trying to get out of there asap. What agent are you using for fencing? Doesn't sound very reliable. > > The messages look like this: > --- > Apr 5 14:14:14 h01 stonith-ng: [13076]: info: crm_new_peer: Node h05 now has > id: 84939948 > Apr 5 14:14:14 h01 stonith-ng: [13076]: info: crm_new_peer: Node 84939948 is > now known as h05 > Apr 5 14:14:14 h01 stonith-ng: [13076]: notice: remote_op_done: Operation > st_fence of h01 by h05 for h05[c1ed07ad-25b2-4ea0-a168-0b667ec0dded]: OK > Apr 5 14:14:14 h01 crmd: [13080]: ERROR: tengine_stonith_notify: We were > alegedly just fenced by h05 for h05! > Apr 5 14:14:14 h01 crmd: [13080]: ERROR: do_log: FSA: Input I_ERROR from > tengine_stonith_notify() received in state S_NOT_DC > Apr 5 14:14:14 h01 crmd: [13080]: notice: do_state_transition: State > transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL > origin=tengine_stonith_notify ] > Apr 5 14:14:14 h01 crmd: [13080]: ERROR: do_recover: Action A_RECOVER > (0000000001000000) not supported > Apr 5 14:14:14 h01 crmd: [13080]: ERROR: do_log: FSA: Input I_TERMINATE from > do_recover() received in state S_RECOVERY > Apr 5 14:14:14 h01 crmd: [13080]: notice: do_state_transition: State > transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL > origin=do_recover ] > Apr 5 14:14:14 h01 crmd: [13080]: info: do_shutdown: Disconnecting STONITH... > Apr 5 14:14:14 h01 crmd: [13080]: info: tengine_stonith_connection_destroy: > Fencing daemon disconnected > Apr 5 14:14:14 h01 crmd: [13080]: info: lrm_connection_destroy: LRM > Connection disconnected > Apr 5 14:14:14 h01 crmd: [13080]: info: do_lrm_control: Disconnected from > the LRM > Apr 5 14:14:14 h01 crmd: [13080]: notice: terminate_ais_connection: > Disconnecting from Corosync > Apr 5 14:14:14 h01 crmd: [13080]: info: do_ha_control: Disconnected from > OpenAIS > Apr 5 14:14:14 h01 crmd: [13080]: info: do_cib_control: Disconnecting CIB > Apr 5 14:14:14 h01 crmd: [13080]: info: crmd_cib_connection_destroy: > Connection to the CIB terminated... > Apr 5 14:14:14 h01 crmd: [13080]: info: do_exit: Performing A_EXIT_0 - > gracefully exiting the CRMd > Apr 5 14:14:14 h01 crmd: [13080]: ERROR: do_exit: Could not recover from > internal error > Apr 5 14:14:14 h01 crmd: [13080]: info: free_mem: Dropping I_TERMINATE: [ > state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ] > Apr 5 14:14:14 h01 crmd: [13080]: info: crm_xml_cleanup: Cleaning up memory > from libxml2 > Apr 5 14:14:14 h01 crmd: [13080]: info: do_exit: [crmd] stopped (2) > Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] info: pcmk_ipc_exit: Client > crmd (conn=0x7322c0, async-conn=0x7322c0) left > Apr 5 14:14:14 h01 cib: [13075]: WARN: send_ipc_message: IPC Channel to > 13080 is not connected > Apr 5 14:14:14 h01 cib: [13075]: WARN: send_via_callback_channel: Delivery > of reply to client 13080/bc52dbfb-ae66-47fe-a85a-32b61a57fdf5 failed > Apr 5 14:14:14 h01 cib: [13075]: WARN: do_local_notify: A-Sync reply to crmd > failed: reply failed > Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] ERROR: pcmk_wait_dispatch: > Child process crmd exited (pid=13080, rc=2) > Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] info: update_member: Node h01 > now has process list: 00000000000000000000000000151112 (1380626) > Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] notice: pcmk_wait_dispatch: > Respawning failed child process: crmd > Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] info: spawn_child: Forked > child 16477 for process crmd > Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] info: update_member: Node h01 > now has process list: 00000000000000000000000000151312 (1381138) > Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] info: update_member: Node h01 > now has process list: 00000000000000000000000000151112 (1380626) > Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] info: update_member: Node h01 > now has process list: 00000000000000000000000000151312 (1381138) > Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] info: > send_member_notification: Sending membership update 1572 to 1 children > Apr 5 14:14:14 h01 cib: [13075]: info: ais_dispatch_message: Membership > 1572: quorum retained > Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] WARN: route_ais_message: > Sending message to local.crmd failed: ipc delivery failed (rc=-2) > Apr 5 14:14:14 h01 crmd: [16477]: info: Invoked: /usr/lib64/pacemaker/crmd > Apr 5 14:14:14 h01 crmd: [16477]: info: crm_log_init_worker: Changed active > directory to /var/lib/heartbeat/cores/hacluster > Apr 5 14:14:14 h01 crmd: [16477]: notice: main: CRM Git Version: > 77eeb099a504ceda05d648ed161ef8b1582c7daf > Apr 5 14:14:14 h01 crmd: [16477]: info: do_cib_control: CIB connection > established > Apr 5 14:14:14 h01 crmd: [16477]: info: get_cluster_type: Cluster type is: > 'openais' > Apr 5 14:14:14 h01 crmd: [16477]: notice: crm_cluster_connect: Connecting to > cluster infrastructure: classic openais (with plugin) > Apr 5 14:14:14 h01 crmd: [16477]: info: init_ais_connection_classic: > Creating connection to our Corosync plugin > Apr 5 14:14:14 h01 crmd: [16477]: info: init_ais_connection_classic: AIS > connection established > Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] info: pcmk_ipc: Recorded > connection 0x7f48c8000e90 for crmd/16477 > Apr 5 14:14:14 h01 corosync[13037]: [pcmk ] info: pcmk_ipc: Sending > membership update 1572 to crmd > Apr 5 14:14:14 h01 crmd: [16477]: info: get_ais_nodeid: Server details: > id=17831084 uname=h01 cname=pcmk > Apr 5 14:14:14 h01 crmd: [16477]: info: init_ais_connection_once: Connection > to 'classic openais (with plugin)': established > Apr 5 14:14:14 h01 crmd: [16477]: info: crm_new_peer: Node h01 now has id: > 17831084 > Apr 5 14:14:14 h01 crmd: [16477]: info: crm_new_peer: Node 17831084 is now > known as h01 > Apr 5 14:14:14 h01 crmd: [16477]: info: ais_status_callback: status: h01 is > now unknown > Apr 5 14:14:14 h01 crmd: [16477]: info: do_ha_control: Connected to the > cluster > Apr 5 14:14:14 h01 crmd: [16477]: info: do_started: Delaying start, no > membership data (0000000000100000) > Apr 5 14:14:14 h01 crmd: [16477]: notice: ais_dispatch_message: Membership > 1572: quorum acquired > Apr 5 14:14:14 h01 crmd: [16477]: info: crm_new_peer: Node h05 now has id: > 84939948 > Apr 5 14:14:14 h01 crmd: [16477]: info: crm_new_peer: Node 84939948 is now > known as h05 > Apr 5 14:14:14 h01 crmd: [16477]: info: ais_status_callback: status: h05 is > now unknown > Apr 5 14:14:14 h01 crmd: [16477]: info: ais_status_callback: status: h05 is > now member (was unknown) > Apr 5 14:14:14 h01 crmd: [16477]: info: crm_update_peer: Node h05: > id=84939948 state=member (new) addr=r(0) ip(172.20.16.5) r(1) ip(10.2.2.5) > votes=1 born=1348 seen=1572 proc=00000000000000000000000000151312 > Apr 5 14:14:14 h01 crmd: [16477]: notice: crmd_peer_update: Status update: > Client h01/crmd now has status [online] (DC=<null>) > Apr 5 14:14:14 h01 crmd: [16477]: info: ais_status_callback: status: h01 is > now member (was unknown) > Apr 5 14:14:14 h01 crmd: [16477]: info: crm_update_peer: Node h01: > id=17831084 state=member (new) addr=r(0) ip(172.20.16.1) r(1) ip(10.2.2.1) > (new) votes=1 (new) born=1572 seen=1572 proc=000000000000000000000000001 > 51312 (new) > --- > Anybody else? > > Regards, > Ulrich > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
