Hello,

First of all, thank you very much for your quickly reply.

Your advice has made me thinking about the energy problem and its relation with stonith. In my case, I use two machines with ILO-similar system (like HP servers) and two power supplies.

Really, it's a very strange event that the two power supplies will fail together. The another case would be the motherboard will get seriously damaged.

In any case, I understand I'll need a third element (independent of both machines) to ensure that stonith works fine. Maybe something like an UPS or an advanced power supply line.

I'll try to investigate about this a little more. Again, thank you a lot for your help.

Cheers,

El 08/06/12 13:45, Florian Haas escribió:
On Fri, Jun 8, 2012 at 1:01 PM, Juan M. Sierra<jmsie...@cica.es>  wrote:
Problem with state: UNCLEAN (OFFLINE)

Hello,

I'm trying to get up a directord service with pacemaker.

But, I found a problem with the unclean (offline) state. The initial state
of my cluster was this:

Online: [ node2 node1 ]

node1-STONITH    (stonith:external/ipmi):        Started node2
node2-STONITH    (stonith:external/ipmi):        Started node1
  Clone Set: Connected
      Started: [ node2 node1 ]
  Clone Set: ldirector-activo-activo
      Started: [ node2 node1 ]
ftp-vip (ocf::heartbeat:IPaddr):        Started node1
web-vip (ocf::heartbeat:IPaddr):        Started node2

Migration summary:
* Node node1:  pingd=2000
* Node node2:  pingd=2000
    node2-STONITH: migration-threshold=1000000 fail-count=1000000

and then, I removed the electric connection of node1, the state was the
next:

Node node1 (8b2aede9-61bb-4a5a-aef6-25fbdefdddfd): UNCLEAN (offline)
Online: [ node2 ]

node1-STONITH    (stonith:external/ipmi):        Started node2 FAILED
  Clone Set: Connected
      Started: [ node2 ]
      Stopped: [ ping:1 ]
  Clone Set: ldirector-activo-activo
      Started: [ node2 ]
      Stopped: [ ldirectord:1 ]
web-vip (ocf::heartbeat:IPaddr):        Started node2

Migration summary:
* Node node2:  pingd=2000
    node2-STONITH: migration-threshold=1000000 fail-count=1000000
    node1-STONITH: migration-threshold=1000000 fail-count=1000000

Failed actions:
     node2-STONITH_start_0 (node=node2, call=22, rc=2, status=complete):
invalid parameter
     node1-STONITH_monitor_60000 (node=node2, call=11, rc=14,
status=complete): status: unknown
     node1-STONITH_start_0 (node=node2, call=34, rc=1, status=complete):
unknown error

I was hoping that node2 take the management of ftp-vip resource, but it
wasn't in that way. node1 kept in a unclean state and node2 didn't take the
management of its resources. When I put back the electric connection of
node1 and it was recovered then, node2 took the management of ftp-vip
resource.

I've seen some similar conversations here. Please, could you show me some
idea about this subject or some thread where this is discussed?
Well your healthy node failed to fence your offending node. So fix
your STONITH device configuration and as soon as that is able to
fence, your failover should work fine.

Of course, if your IPMI BMC fails immediately after you remove power
from the machine (i.e. it has no backup battery so it can at least
report the power status), then you might have to fix your issue by
switching to a different STONITH device altogether.

Cheers,
Florian


--
Juan Manuel Sierra Prieto
Administración de Sistemas
Centro Informatico Cientifico de Andalucia (CICA)
Avda. Reina Mercedes s/n - 41012 - Sevilla (Spain)
Tfno.: +34 955 056 600 / FAX: +34 955 056 650
Consejería de Economía, Innovación y Ciencia
Junta de Andalucía


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to