Hello Diane, the problem is that pacemaker is not allowed to take over resources until stonith succeeds, as it simply does not know about the state of the other server. Lets assume the other node would still be up and running, would have mounted a shared storage device an would write to it, but would respond to network anymore. If pacemaker would now mount this device again, you would get data corruption. To protect you against that, it requires that stonith succeeds, or that you manually solve that problem.
The only automatic solution would be a more reliable stonith device, e.g. IPMI with an extra power supply for the IPMI card or a PDU. Cheers, Bernd On Tuesday 15 June 2010, Schaefer, Diane E wrote: > Thanks for the idea. Is there any way to automatically recover resources > without manual intervention? > > Diane > > THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY > MATERIAL and is thus for use only by the intended recipient. If you > received this in error, please contact the sender and delete the e-mail > and its attachments from all computers. > > > -----Original Message----- > From: Bernd Schubert [mailto:bs_li...@aakef.fastmail.fm] > Sent: Tuesday, June 15, 2010 1:39 PM > To: pacemaker@oss.clusterlabs.org > Cc: Schaefer, Diane E > Subject: Re: [Pacemaker] abrupt power failure problem > > On Tuesday 15 June 2010, Schaefer, Diane E wrote: > > Hi, > > We are having trouble with our two node cluster after one node > > experiences an abrupt power failure. The resources do not seem to start > > on the remaining node (ie DRBD resources do not promote to master). In > > the log we notice: > > > > Jan 8 02:12:27 qpr4 stonithd: [6622]: info: external_run_cmd: Calling > > '/usr/lib64/stonith/plugins/external/ipmi reset qpr3' returned 256 Jan > > 8 02:12:27 qpr4 stonithd: [6622]: CRIT: external_reset_req: 'ipmi reset' > > for host qpr3 failed with rc 256 Jan 8 02:12:27 qpr4 stonithd: [5854]: > > info: failed to STONITH node qpr3 with local device stonith0 (exitcode > > 5), gonna try the next local device Jan 8 02:12:27 qpr4 stonithd: > > [5854]: info: we can't manage qpr3, broadcast request to other nodes Jan > > 8 02:13:27 qpr4 stonithd: [5854]: ERROR: Failed to STONITH the node qpr3: > > optype=RESET, op_result=TIMEOUT > > > > Jan 8 02:13:27 qpr4 stonithd: [6763]: info: external_run_cmd: Calling > > '/usr/lib64/stonith/plugins/external/ipmi reset qpr3' returned 256 Jan > > 8 02:13:27 qpr4 stonithd: [6763]: CRIT: external_reset_req: 'ipmi reset' > > for host qpr3 failed with rc 256 Jan 8 02:13:27 qpr4 stonithd: [5854]: > > info: failed to STONITH node qpr3 with local device stonith0 (exitcode > > 5), gonna try the next local device Jan 8 02:13:27 qpr4 stonithd: > > [5854]: info: we can't manage qpr3, broadcast request to other nodes Jan > > 8 02:14:27 qpr4 stonithd: [5854]: ERROR: Failed to STONITH the node qpr3: > > optype=RESET, op_result=TIMEOUT > > Without looking at your hb_report, this already looks pretty clear - this > node tries to reset the other node using IPMI and that fails, of course, > as the node to be reset is powered off. > When we had that problem in the past, we simply temporarily removed the > failed node from the pacemaker configuration: crm node remove <node-name> > > > Cheers, > Bernd > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker