[Pacemaker] Resource does not auto recover from failed state

tetsuo shima Tue, 27 Aug 2013 01:38:57 -0700

Hi list !

I'm having an issue with corosync, here is the scenario :


# crm_mon -1
============
Last updated: Tue Aug 27 09:50:13 2013
Last change: Mon Aug 26 16:06:01 2013 via cibadmin on node2
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, 2 expected votes
3 Resources configured.
============

Online: [ node2 node1 ]

 ip    (ocf::heartbeat:IPaddr2):    Started node1
 Clone Set: mysql-mm [mysql] (unmanaged)
     mysql:0    (ocf::heartbeat:mysql):    Started node1 (unmanaged)
     mysql:1    (ocf::heartbeat:mysql):    Started node2 (unmanaged)

# /etc/init.d/mysql stop
[ ok ] Stopping MySQL database server: mysqld.

# crm_mon -1
============
Last updated: Tue Aug 27 09:50:30 2013
Last change: Mon Aug 26 16:06:01 2013 via cibadmin on node2
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, 2 expected votes
3 Resources configured.
============

Online: [ node2 node1 ]

 ip    (ocf::heartbeat:IPaddr2):    Started node1
 Clone Set: mysql-mm [mysql] (unmanaged)
     mysql:0    (ocf::heartbeat:mysql):    Started node1 (unmanaged)
     mysql:1    (ocf::heartbeat:mysql):    Started node2 (unmanaged) FAILED

Failed actions:
    mysql:0_monitor_15000 (node=node2, call=27, rc=7, status=complete): not
running

# /etc/init.d/mysql start
[ ok ] Starting MySQL database server: mysqld ..
[info] Checking for tables which need an upgrade, are corrupt or were
not closed cleanly..

# sleep 60 && crm_mon -1
============
Last updated: Tue Aug 27 09:51:54 2013
Last change: Mon Aug 26 16:06:01 2013 via cibadmin on node2
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, 2 expected votes
3 Resources configured.
============

Online: [ node2 node1 ]

 ip    (ocf::heartbeat:IPaddr2):    Started node1
 Clone Set: mysql-mm [mysql] (unmanaged)
     mysql:0    (ocf::heartbeat:mysql):    Started node1 (unmanaged)
     mysql:1    (ocf::heartbeat:mysql):    Started node2 (unmanaged) FAILED

Failed actions:
    mysql:0_monitor_15000 (node=node2, call=27, rc=7, status=complete): not
running

As you can see, every time I stop Mysql (which is unmanaged), the resource
is marked as failed :

crmd: [1828]: info: process_lrm_event: LRM operation mysql:0_monitor_15000
(call=4, rc=7, cib-update=10, confirmed=false) not running

When I restart the resource :

crmd: [1828]: info: process_lrm_event: LRM operation mysql:0_monitor_15000
(call=4, rc=0, cib-update=11, confirmed=false) ok

The resource is still in failed state and does not recover until I manually
clean up the resource.

# crm_mon --one-shot --operations
============
Last updated: Tue Aug 27 10:17:30 2013
Last change: Mon Aug 26 16:06:01 2013 via cibadmin on node2
Stack: openais
Current DC: node1 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, 2 expected votes
3 Resources configured.
============

Online: [ node2 node1 ]

 ip    (ocf::heartbeat:IPaddr2):    Started node1
 Clone Set: mysql-mm [mysql] (unmanaged)
     mysql:0    (ocf::heartbeat:mysql):    Started node1 (unmanaged)
     mysql:1    (ocf::heartbeat:mysql):    Started node2 (unmanaged) FAILED

Operations:
* Node node1:
   ip: migration-threshold=1
    + (57) probe: rc=0 (ok)
   mysql:0: migration-threshold=1 fail-count=1
    + (58) probe: rc=0 (ok)
    + (59) monitor: interval=15000ms rc=0 (ok)
* Node node2:
   mysql:0: migration-threshold=1 fail-count=3
    + (27) monitor: interval=15000ms rc=7 (not running)
    + (27) monitor: interval=15000ms rc=0 (ok)

Failed actions:
    mysql:0_monitor_15000 (node=node2, call=27, rc=7, status=complete): not
running

---

Here is some details about my configuration :

# cat /etc/debian_version
7.1

# dpk# dpkg -l | grep corosync
ii  corosync                         1.4.2-3
amd64        Standards-based cluster framework

# dpkg -l | grep pacem
ii  pacemaker                        1.1.7-1
amd64        HA cluster resource manager

# crm configure show
node node2 \
    attributes standby="off"
node node1
primitive ip ocf:heartbeat:IPaddr2 \
    params ip="192.168.0.20" cidr_netmask="255.255.0.0" nic="eth2.2755"
iflabel="mysql" \
    meta is-managed="true" target-role="Started" \
    meta resource-stickiness="100"
primitive mysql ocf:heartbeat:mysql \
    op monitor interval="15" timeout="30"
clone mysql-mm mysql \
    meta is-managed="false"
location cli-prefer-ip ip 50: node1
colocation ip-on-mysql-mm 200: ip mysql-mm
property $id="cib-bootstrap-options" \
    dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
    cluster-infrastructure="openais" \
    expected-quorum-votes="2" \
    stonith-enabled="false" \
    no-quorum-policy="ignore" \
    last-lrm-refresh="1377513557" \
    start-failure-is-fatal="false"
rsc_defaults $id="rsc-options" \
    resource-stickiness="1" \
    migration-threshold="1"

---

Does anyone know what is wrong with my configuration ?

Thanks for the help,

Best regards.

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] Resource does not auto recover from failed state

Reply via email to