Hi!

I had a situation when one node was periodically fenced when there was a busy 
network. The node bing fenced tried to restart crmd after some problem, and 
shortly after rejoining the cluster, it was fenced.

The messages look like this:
---
Apr  5 14:14:14 h01 stonith-ng: [13076]: info: crm_new_peer: Node h05 now has 
id: 84939948
Apr  5 14:14:14 h01 stonith-ng: [13076]: info: crm_new_peer: Node 84939948 is 
now known as h05
Apr  5 14:14:14 h01 stonith-ng: [13076]: notice: remote_op_done: Operation 
st_fence of h01 by h05 for h05[c1ed07ad-25b2-4ea0-a168-0b667ec0dded]: OK
Apr  5 14:14:14 h01 crmd: [13080]: ERROR: tengine_stonith_notify: We were 
alegedly just fenced by h05 for h05!
Apr  5 14:14:14 h01 crmd: [13080]: ERROR: do_log: FSA: Input I_ERROR from 
tengine_stonith_notify() received in state S_NOT_DC
Apr  5 14:14:14 h01 crmd: [13080]: notice: do_state_transition: State 
transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL 
origin=tengine_stonith_notify ]
Apr  5 14:14:14 h01 crmd: [13080]: ERROR: do_recover: Action A_RECOVER 
(0000000001000000) not supported
Apr  5 14:14:14 h01 crmd: [13080]: ERROR: do_log: FSA: Input I_TERMINATE from 
do_recover() received in state S_RECOVERY
Apr  5 14:14:14 h01 crmd: [13080]: notice: do_state_transition: State 
transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL 
origin=do_recover ]
Apr  5 14:14:14 h01 crmd: [13080]: info: do_shutdown: Disconnecting STONITH...
Apr  5 14:14:14 h01 crmd: [13080]: info: tengine_stonith_connection_destroy: 
Fencing daemon disconnected
Apr  5 14:14:14 h01 crmd: [13080]: info: lrm_connection_destroy: LRM Connection 
disconnected
Apr  5 14:14:14 h01 crmd: [13080]: info: do_lrm_control: Disconnected from the 
LRM
Apr  5 14:14:14 h01 crmd: [13080]: notice: terminate_ais_connection: 
Disconnecting from Corosync
Apr  5 14:14:14 h01 crmd: [13080]: info: do_ha_control: Disconnected from 
OpenAIS
Apr  5 14:14:14 h01 crmd: [13080]: info: do_cib_control: Disconnecting CIB
Apr  5 14:14:14 h01 crmd: [13080]: info: crmd_cib_connection_destroy: 
Connection to the CIB terminated...
Apr  5 14:14:14 h01 crmd: [13080]: info: do_exit: Performing A_EXIT_0 - 
gracefully exiting the CRMd
Apr  5 14:14:14 h01 crmd: [13080]: ERROR: do_exit: Could not recover from 
internal error
Apr  5 14:14:14 h01 crmd: [13080]: info: free_mem: Dropping I_TERMINATE: [ 
state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ]
Apr  5 14:14:14 h01 crmd: [13080]: info: crm_xml_cleanup: Cleaning up memory 
from libxml2
Apr  5 14:14:14 h01 crmd: [13080]: info: do_exit: [crmd] stopped (2)
Apr  5 14:14:14 h01 corosync[13037]:  [pcmk  ] info: pcmk_ipc_exit: Client crmd 
(conn=0x7322c0, async-conn=0x7322c0) left
Apr  5 14:14:14 h01 cib: [13075]: WARN: send_ipc_message: IPC Channel to 13080 
is not connected
Apr  5 14:14:14 h01 cib: [13075]: WARN: send_via_callback_channel: Delivery of 
reply to client 13080/bc52dbfb-ae66-47fe-a85a-32b61a57fdf5 failed
Apr  5 14:14:14 h01 cib: [13075]: WARN: do_local_notify: A-Sync reply to crmd 
failed: reply failed
Apr  5 14:14:14 h01 corosync[13037]:  [pcmk  ] ERROR: pcmk_wait_dispatch: Child 
process crmd exited (pid=13080, rc=2)
Apr  5 14:14:14 h01 corosync[13037]:  [pcmk  ] info: update_member: Node h01 
now has process list: 00000000000000000000000000151112 (1380626)
Apr  5 14:14:14 h01 corosync[13037]:  [pcmk  ] notice: pcmk_wait_dispatch: 
Respawning failed child process: crmd
Apr  5 14:14:14 h01 corosync[13037]:  [pcmk  ] info: spawn_child: Forked child 
16477 for process crmd
Apr  5 14:14:14 h01 corosync[13037]:  [pcmk  ] info: update_member: Node h01 
now has process list: 00000000000000000000000000151312 (1381138)
Apr  5 14:14:14 h01 corosync[13037]:  [pcmk  ] info: update_member: Node h01 
now has process list: 00000000000000000000000000151112 (1380626)
Apr  5 14:14:14 h01 corosync[13037]:  [pcmk  ] info: update_member: Node h01 
now has process list: 00000000000000000000000000151312 (1381138)
Apr  5 14:14:14 h01 corosync[13037]:  [pcmk  ] info: send_member_notification: 
Sending membership update 1572 to 1 children
Apr  5 14:14:14 h01 cib: [13075]: info: ais_dispatch_message: Membership 1572: 
quorum retained
Apr  5 14:14:14 h01 corosync[13037]:  [pcmk  ] WARN: route_ais_message: Sending 
message to local.crmd failed: ipc delivery failed (rc=-2)
Apr  5 14:14:14 h01 crmd: [16477]: info: Invoked: /usr/lib64/pacemaker/crmd
Apr  5 14:14:14 h01 crmd: [16477]: info: crm_log_init_worker: Changed active 
directory to /var/lib/heartbeat/cores/hacluster
Apr  5 14:14:14 h01 crmd: [16477]: notice: main: CRM Git Version: 
77eeb099a504ceda05d648ed161ef8b1582c7daf
Apr  5 14:14:14 h01 crmd: [16477]: info: do_cib_control: CIB connection 
established
Apr  5 14:14:14 h01 crmd: [16477]: info: get_cluster_type: Cluster type is: 
'openais'
Apr  5 14:14:14 h01 crmd: [16477]: notice: crm_cluster_connect: Connecting to 
cluster infrastructure: classic openais (with plugin)
Apr  5 14:14:14 h01 crmd: [16477]: info: init_ais_connection_classic: Creating 
connection to our Corosync plugin
Apr  5 14:14:14 h01 crmd: [16477]: info: init_ais_connection_classic: AIS 
connection established
Apr  5 14:14:14 h01 corosync[13037]:  [pcmk  ] info: pcmk_ipc: Recorded 
connection 0x7f48c8000e90 for crmd/16477
Apr  5 14:14:14 h01 corosync[13037]:  [pcmk  ] info: pcmk_ipc: Sending 
membership update 1572 to crmd
Apr  5 14:14:14 h01 crmd: [16477]: info: get_ais_nodeid: Server details: 
id=17831084 uname=h01 cname=pcmk
Apr  5 14:14:14 h01 crmd: [16477]: info: init_ais_connection_once: Connection 
to 'classic openais (with plugin)': established
Apr  5 14:14:14 h01 crmd: [16477]: info: crm_new_peer: Node h01 now has id: 
17831084
Apr  5 14:14:14 h01 crmd: [16477]: info: crm_new_peer: Node 17831084 is now 
known as h01
Apr  5 14:14:14 h01 crmd: [16477]: info: ais_status_callback: status: h01 is 
now unknown
Apr  5 14:14:14 h01 crmd: [16477]: info: do_ha_control: Connected to the cluster
Apr  5 14:14:14 h01 crmd: [16477]: info: do_started: Delaying start, no 
membership data (0000000000100000)
Apr  5 14:14:14 h01 crmd: [16477]: notice: ais_dispatch_message: Membership 
1572: quorum acquired
Apr  5 14:14:14 h01 crmd: [16477]: info: crm_new_peer: Node h05 now has id: 
84939948
Apr  5 14:14:14 h01 crmd: [16477]: info: crm_new_peer: Node 84939948 is now 
known as h05
Apr  5 14:14:14 h01 crmd: [16477]: info: ais_status_callback: status: h05 is 
now unknown
Apr  5 14:14:14 h01 crmd: [16477]: info: ais_status_callback: status: h05 is 
now member (was unknown)
Apr  5 14:14:14 h01 crmd: [16477]: info: crm_update_peer: Node h05: id=84939948 
state=member (new) addr=r(0) ip(172.20.16.5) r(1) ip(10.2.2.5)  votes=1 
born=1348 seen=1572 proc=00000000000000000000000000151312
Apr  5 14:14:14 h01 crmd: [16477]: notice: crmd_peer_update: Status update: 
Client h01/crmd now has status [online] (DC=<null>)
Apr  5 14:14:14 h01 crmd: [16477]: info: ais_status_callback: status: h01 is 
now member (was unknown)
Apr  5 14:14:14 h01 crmd: [16477]: info: crm_update_peer: Node h01: id=17831084 
state=member (new) addr=r(0) ip(172.20.16.1) r(1) ip(10.2.2.1)  (new) votes=1 
(new) born=1572 seen=1572 proc=000000000000000000000000001
51312 (new)
---
Anybody else?

Regards,
Ulrich

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to