Hi, On Fri, May 11, 2012 at 06:05:27PM +0300, Spiros Ioannou wrote: > Hello all, > Corosync/pacemaker with 3 nodes(vserver1, vserver2, vserver3), 28 resources > defined, with quorum, stonith (via ipmi/ilo). > Most of the time all these work correctly but when we shutdown servers and > try to restart them, then strange things may happen: > > 1) all servers could be stuck in an unclean state (unclean online itself, > unclean offline the others, OR ok itself, unclean offline the others). The > only way to resolve this is to shutdown the severs and start them with a > big time interval between, several times. > > 2) in the last startup of the servers, two servers were up, quorum was > true, and then when the 3rd started to boot one of the two issued stonith > to the 3rd, and then to the other one, and then to itself (the latter > failed).
A node cannot fence itself. That's prohibited. > This happened several times. resulting in downtime. > In the logs below vserver1 should stonith vserver3 (since it's down) but it > should'nt stonith vserver2 since it has quorum with it. > > We are in the dark, please help if you can. Is this new, i.e. was this cluster working fine for a while earlier? I suppose that you did some testing before going live. Below some notes, but in general cannot offer much help. > Best Regards, > Spiros Ioannou > > (from hb_report's events.txt) > > May 10 20:08:58 vserver1 corosync[3965]: [MAIN ] Corosync Cluster Engine > ('1.2.1'): started and ready to provide service. > May 10 20:42:25 vserver1 corosync[3965]: [pcmk ] info: pcmk_peer_update: > memb: vserver1 218169610 > May 10 20:42:25 vserver1 crmd: [3999]: info: crm_update_quorum: Updating > quorum status to false (call=36) > May 10 20:48:44 vserver1 corosync[3965]: [pcmk ] info: pcmk_peer_update: > memb: vserver1 218169610 > May 10 20:48:44 vserver1 crmd: [3999]: notice: ais_dispatch: Membership > 4804: quorum acquired > May 10 20:48:44 *vserver1* crmd: [3999]: info: crm_update_quorum:* Updating > quorum status to true* (call=87) > May 10 20:48:44 vserver2 corosync[4202]: [MAIN ] Corosync Cluster Engine > ('1.2.1'): started and ready to provide service. > May 10 20:48:45 vserver2 crmd: [4233]: notice: ais_dispatch: Membership > 4804: quorum acquired > May 10 20:48:46 *vserver1* crmd: [3999]: info: crm_update_quorum: > Updating *quorum > status to true* (call=102) > May 10 20:48:53 vserver1 crmd: [3999]: info: te_fence_node: Executing > poweroff fencing operation (50) on vserver3 (timeout=180000) > May 10 20:48:53 vserver1 stonithd: [3994]: info: client tengine [pid: 3999] > requests a STONITH operation POWEROFF on node vserver3 > May 10 20:49:00 vserver1 stonithd: [3994]: info: Succeeded to STONITH the > node vserver3: optype=POWEROFF. whodoit: vserver1 > May 10 20:50:59 vserver1 crmd: [3999]: info: te_fence_node: Executing > poweroff fencing operation (36) on vserver2 (timeout=180000) > May 10 20:50:59 vserver1 stonithd: [3994]: info: client tengine [pid: 3999] > requests a STONITH operation POWEROFF on node vserver2 > May 10 20:51:06 vserver1 stonithd: [3994]: info: Succeeded to STONITH the > node vserver2: optype=POWEROFF. whodoit: vserver1 > May 10 20:51:06 vserver1 crmd: [3999]: info: te_fence_node: Executing > poweroff fencing operation (19) on vserver1 (timeout=180000) > May 10 20:51:06 vserver1 stonithd: [3994]: info: client tengine [pid: 3999] > requests a STONITH operation POWEROFF on node vserver1 > May 10 20:51:10 vserver1 corosync[3965]: [pcmk ] info: pcmk_peer_update: > memb: vserver1 218169610 > May 10 20:51:10 vserver1 corosync[3965]: [pcmk ] info: pcmk_peer_update: > lost: vserver2 184615178 > May 10 20:51:10 vserver1 crmd: [3999]: notice: ais_dispatch: Membership > 4808: quorum lost > May 10 20:51:10 vserver1 crmd: [3999]: info: crm_update_quorum: Updating > quorum status to false (call=149) > May 10 20:54:06 vserver1 stonithd: [3994]: ERROR: Failed to STONITH the > node vserver1: optype=POWEROFF, op_result=TIMEOUT > > AND (analysis.txt) > > May 10 20:42:00 vserver1 crmd: [3999]: ERROR: crm_timer_popped: Integration > Timer (I_INTEGRATED) just popped! > May 10 20:42:56 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction: > vm-websrv:monitor process (PID 4191) will not die! > May 10 20:42:57 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction: > vm-alm:monitor process (PID 4267) will not die! > May 10 20:42:57 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction: > vm-be:monitor process (PID 4268) will not die! > May 10 20:42:57 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction: > vm-cam:monitor process (PID 4269) will not die! This seems to point to processes in the D state (waiting for device). Your storage hanging/temporarily gone? > May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM > operation vm-websrv_monitor_0 (20) Timed Out (timeout=20000ms) > May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM > operation vm-alm_monitor_0 (10) Timed Out (timeout=20000ms) > May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM > operation vm-be_monitor_0 (11) Timed Out (timeout=20000ms) > May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM > operation vm-cam_monitor_0 (12) Timed Out (timeout=20000ms) > May 10 20:49:20 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM > operation email-vm-cam_start_0 (44) Timed Out (timeout=20000ms) > May 10 20:49:20 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM > operation email-vm-devops_start_0 (46) Timed Out (timeout=20000ms) > May 10 20:49:26 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM > operation email-vm-router_start_0 (48) Timed Out (timeout=20000ms) > May 10 20:49:27 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM > operation email-vm-zabbix_start_0 (49) Timed Out (timeout=20000ms) > May 10 20:49:40 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM > operation email-vm-ipsecgw_start_0 (51) Timed Out (timeout=20000ms) > May 10 20:50:01 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM > operation email-vm-cam_stop_0 (56) Timed Out (timeout=20000ms) > May 10 20:50:01 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM > operation email-vm-devops_stop_0 (58) Timed Out (timeout=20000ms) > May 10 20:50:01 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM > operation email-vm-router_stop_0 (59) Timed Out (timeout=20000ms) > May 10 20:50:22 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM > operation email-vm-zabbix_stop_0 (61) Timed Out (timeout=20000ms) > May 10 20:50:23 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM > operation email-vm-ipsecgw_stop_0 (63) Timed Out (timeout=20000ms) > May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order: > Invalid inputs (nil).(nil) 0x127c090.(nil) > May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order: > Invalid inputs (nil).(nil) 0x127c880.(nil) > May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order: > Invalid inputs (nil).(nil) 0x127cd30.(nil) > May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order: > Invalid inputs (nil).(nil) 0x127d1e0.(nil) Hmm, that's rather cryptic. Don't know what does it mean. Perhaps search the ML archives? Thanks, Dejan > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org