Thanks for the idea. Is there any way to automatically recover resources 
without manual intervention?

Diane

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY 
MATERIAL and is thus for use only by the intended recipient. If you received 
this in error, please contact the sender and delete the e-mail and its 
attachments from all computers.


-----Original Message-----
From: Bernd Schubert [mailto:bs_li...@aakef.fastmail.fm]
Sent: Tuesday, June 15, 2010 1:39 PM
To: pacemaker@oss.clusterlabs.org
Cc: Schaefer, Diane E
Subject: Re: [Pacemaker] abrupt power failure problem

On Tuesday 15 June 2010, Schaefer, Diane E wrote:
> Hi,
>   We are having trouble with our two node cluster after one node
>  experiences an abrupt power failure.  The resources do not seem to start
>  on the remaining node (ie DRBD resources do not promote to master).  In
>  the log we notice:
>
> Jan  8 02:12:27 qpr4 stonithd: [6622]: info: external_run_cmd: Calling
>  '/usr/lib64/stonith/plugins/external/ipmi reset qpr3' returned 256 Jan  8
>  02:12:27 qpr4 stonithd: [6622]: CRIT: external_reset_req: 'ipmi reset' for
>  host qpr3 failed with rc 256 Jan  8 02:12:27 qpr4 stonithd: [5854]: info:
>  failed to STONITH node qpr3 with local device stonith0 (exitcode 5), gonna
>  try the next local device Jan  8 02:12:27 qpr4 stonithd: [5854]: info: we
>  can't manage qpr3, broadcast request to other nodes Jan  8 02:13:27 qpr4
>  stonithd: [5854]: ERROR: Failed to STONITH the node qpr3: optype=RESET,
>  op_result=TIMEOUT
>
> Jan  8 02:13:27 qpr4 stonithd: [6763]: info: external_run_cmd: Calling
>  '/usr/lib64/stonith/plugins/external/ipmi reset qpr3' returned 256 Jan  8
>  02:13:27 qpr4 stonithd: [6763]: CRIT: external_reset_req: 'ipmi reset' for
>  host qpr3 failed with rc 256 Jan  8 02:13:27 qpr4 stonithd: [5854]: info:
>  failed to STONITH node qpr3 with local device stonith0 (exitcode 5), gonna
>  try the next local device Jan  8 02:13:27 qpr4 stonithd: [5854]: info: we
>  can't manage qpr3, broadcast request to other nodes Jan  8 02:14:27 qpr4
>  stonithd: [5854]: ERROR: Failed to STONITH the node qpr3: optype=RESET,
>  op_result=TIMEOUT

Without looking at your hb_report, this already looks pretty clear - this node
tries to reset the other node using IPMI and that fails, of course, as the
node to be reset is powered off.
When we had that problem in the past, we simply temporarily removed the failed
node from the pacemaker configuration: crm node remove <node-name>


Cheers,
Bernd

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Reply via email to