[Pacemaker] Problem: monitor timeout causes cluster resource unmanaged and stopped on both nodes.

Oscar Remírez de Ganuza Satrústegui Wed, 16 Dec 2009 08:56:22 -0800

Hello everybody,

We have a shared disk cluster with pacemaker (1.0.6) and corosync (1.1.2). Two nodes, active (herculespre) - passive (jasonpre). MySQL service as a resource, inside a group with attribute migration-threshold="5" (so that if a resource fail-count is 5 or more, it will be moved to the passive node)

Yesterday we launched some mysql benchmarking and found out that a resource monitor timeout brought us to the resource unmanaged and stopped on both nodes!!


> crm_mon
...
Resource Group: mysql-horde-group
    mysql-horde-nfs    (ocf::heartbeat:Filesystem):    Started herculespre

mysql-horde-service (lsb:mysql-horde): Started herculespre (unmanaged) FAILED <------ !!!!

I can not understand why this situation came out. Maybe we have not properly configured the cluster? Or could it be another problem?


Analyzing the problem step by step:

1. The resource monitor timed out, and the fail-count got updated +1:

Dec 15 20:12:54 herculespre lrmd: [8559]: WARN: mysql-horde-service:monitor process (PID 12229) timed out (try 1). Killing with signal SIGTERM (15). Dec 15 20:12:54 herculespre lrmd: [8559]: WARN: operation monitor[37] on lsb::mysql-horde::mysql-horde-service for client 8562, its parameters: CRM_meta_interval=[10000] CRM_meta_timeout=[5000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor] : pid [12229] timed out Dec 15 20:12:54 herculespre crmd: [8562]: ERROR: process_lrm_event: LRM operation mysql-horde-service_monitor_10000 (37) Timed Out (timeout=5000ms) Dec 15 20:12:54 herculespre attrd: [8560]: info: attrd_local_callback: Expanded fail-count-mysql-horde-service=value++ to 1

That's ok. It could be that our mysql service was busy performing all the test operations, and took more that 5s to answer.


2. The CRM decided to stop the service.

Dec 15 20:12:55 herculespre crmd: [8562]: info: do_lrm_rsc_op: Performing key=4:1379:0:ae99a943-f4b7-4979-b0c9-09c7f9dd0f9f op=mysql-horde-service_stop_0 ) Dec 15 20:12:55 herculespre lrmd: [8559]: info: rsc:mysql-horde-service:38: stop

3. The MySQL service received the order and shutted down properly. From mysql.log: 091215 20:13:14 [Note] /usr/local/etc2/mysql-horde/libexec/mysqld: Normal shutdown

...

091215 20:13:17 [Note] /usr/local/etc2/mysql-horde/libexec/mysqld: Shutdown complete

4. Here comes the problem: the cluster did not received the confirmation that the service was properly shutted down: Dec 15 20:13:17 herculespre lrmd: [8559]: WARN: mysql-horde-service:stop process (PID 12270) timed out (try 1). Killing with signal SIGTERM (15). Dec 15 20:13:17 herculespre lrmd: [8559]: WARN: operation stop[38] on lsb::mysql-horde::mysql-horde-service for client 8562, its parameters: CRM_meta_timeout=[20000] crm_feature_set=[3.0.1] : pid [12270] timed out Dec 15 20:13:17 herculespre crmd: [8562]: ERROR: process_lrm_event: LRM operation mysql-horde-service_stop_0 (38) Timed Out (timeout=20000ms)

What is happening here?? As it appears in the log, the timeout is suposed to be 20s (20000ms), and the service jsut took 3s to shutdown.

Is it a problem with lrmd?

5. As a stop-action has failed, the fail-count got updated to INFINITY

Dec 15 20:13:17 herculespre attrd: [8560]: info: attrd_perform_update: Sent update 43: fail-count-mysql-horde-service=INFINITY

The final output is that the resource is unmanaged and stopped on both nodes!!!

We have not configured any stonith resource, as the servers are on Vmware Virtual Center and we have not found a working solution. We could not stopped the crmd process properly, and we had to kill it. When we rebooted the node, this situation got solved, so we reckon that with a stonith resource this could have been solved automatically. But we still do not understand what happened here and why the resource got "unmanaged".

Could this be a bug?

I attached the full logs and the configuration, as I did not want to spam the list.

We will highly appreciate any help understanding what could have happened wrong.


Thank you very much for your help!!

Regards,

---
Oscar Remírez de Ganuza
Servicios Informáticos
Universidad de Navarra
Ed. de Derecho, Campus Universitario
31080 Pamplona (Navarra), Spain
tfno: +34 948 425600 Ext. 3130
http://www.unav.es/SI

crm(live)# configure show
node herculespre \
        attributes standby="off"
node jasonpre \
        attributes standby="off"
primitive mysql-horde-nfs ocf:heartbeat:Filesystem \
        params directory="/usr/local/etc2/mysql-horde/var/" fstype="nfs" 
device="X.X.X.X:/nfs_partition" 
options="auto,rw,timeo=600,soft,intr,wsize=32768,rsize=32768,tcp" \
        meta target-role="Started" \
        op monitor interval="10s" timeout="5s"
primitive mysql-horde-service lsb:mysql-horde \
        meta target-role="Started" \
        op monitor interval="10s" timeout="5s"
primitive pingd ocf:pacemaker:pingd \
        params host_list="X.X.X.X" multiplier="100" \
        op monitor interval="15s" timeout="5s" \
        meta target-role="Started"
group mysql-horde-group mysql-horde-nfs mysql-horde-service \
        meta migration-threshold="5"
clone pingdclone pingd \
        meta globally-unique="false"
location mysql-horde-location mysql-horde-group 100: herculespre
location mysql-horde-location-online mysql-horde-group \
        rule $id="mysql-horde-location-online-rule" -inf: not_defined pingd or 
pingd lte 0
property $id="cib-bootstrap-options" \
        dc-version="1.0.6-f709c638237cdff7556cb6ab615f32826c0f8c06" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore" \
        start-failure-is-fatal="false" \
        stonith-action="reboot" \
        last-lrm-refresh="1258714605"

============
Last updated: Tue Dec 15 20:13:17 2009
Stack: openais
Current DC: jasonpre - partition with quorum
Version: 1.0.6-f709c638237cdff7556cb6ab615f32826c0f8c06
2 Nodes configured, 2 expected votes
4 Resources configured.
============

Online: [ herculespre jasonpre ]

 Clone Set: pingdclone
     Started: [ herculespre jasonpre ]
 Resource Group: mysql-horde-group
     mysql-horde-nfs    (ocf::heartbeat:Filesystem):    Started herculespre
     mysql-horde-service        (lsb:mysql-horde):      Started herculespre 
(unmanaged) FAILED
...

Failed actions:
    mysql-horde-service_monitor_10000 (node=herculespre, call=37, rc=-2, 
status=Timed Out): unknown exec error
    mysql-horde-service_stop_0 (node=herculespre, call=38, rc=-2, status=Timed 
Out): unknown exec error

Dec 15 20:12:54 herculespre lrmd: [8559]: WARN: mysql-horde-service:monitor 
process (PID 12229) timed out (try 1). Killing with signal SIGTERM (15).
Dec 15 20:12:54 herculespre lrmd: [8559]: WARN: operation monitor[37] on 
lsb::mysql-horde::mysql-horde-service for client 8562, its parameters: 
CRM_meta_interval=[10000] CRM_meta_timeout=[5000] crm_feature_set=[3.0.1] 
CRM_meta_name=[monitor] : pid [12229] timed out
Dec 15 20:12:54 herculespre crmd: [8562]: ERROR: process_lrm_event: LRM 
operation mysql-horde-service_monitor_10000 (37) Timed Out (timeout=5000ms)
Dec 15 20:12:54 herculespre attrd: [8560]: info: attrd_ais_dispatch: Update 
relayed from jasonpre
Dec 15 20:12:54 herculespre attrd: [8560]: info: attrd_local_callback: Expanded 
fail-count-mysql-horde-service=value++ to 1
Dec 15 20:12:54 herculespre attrd: [8560]: info: attrd_trigger_update: Sending 
flush op to all hosts for: fail-count-mysql-horde-service (1)
Dec 15 20:12:55 herculespre crmd: [8562]: info: do_lrm_rsc_op: Performing 
key=4:1379:0:ae99a943-f4b7-4979-b0c9-09c7f9dd0f9f op=mysql-horde-service_stop_0 
)
Dec 15 20:12:55 herculespre lrmd: [8559]: info: rsc:mysql-horde-service:38: stop
Dec 15 20:12:55 herculespre lrmd: [12270]: WARN: For LSB init script, no 
additional parameters are needed.
Dec 15 20:12:55 herculespre crmd: [8562]: info: process_lrm_event: LRM 
operation mysql-horde-service_monitor_10000 (call=37, status=1, cib-update=0, 
confirmed=true) Cancelled
Dec 15 20:12:55 herculespre attrd: [8560]: info: attrd_perform_update: Sent 
update 38: fail-count-mysql-horde-service=1
Dec 15 20:12:55 herculespre attrd: [8560]: info: attrd_ais_dispatch: Update 
relayed from jasonpre
Dec 15 20:13:13 herculespre lrmd: [8559]: info: RA output: 
(mysql-horde-service:stop:stdout) Shutting down MySQL.
Dec 15 20:13:17 herculespre attrd: [8560]: info: attrd_trigger_update: Sending 
flush op to all hosts for: last-failure-mysql-horde-service (1260904374)
Dec 15 20:13:17 herculespre lrmd: [8559]: info: RA output: 
(mysql-horde-service:stop:stdout) ..
Dec 15 20:13:17 herculespre attrd: [8560]: info: attrd_perform_update: Sent 
update 41: last-failure-mysql-horde-service=1260904374
Dec 15 20:13:17 herculespre lrmd: [8559]: WARN: mysql-horde-service:stop 
process (PID 12270) timed out (try 1). Killing with signal SIGTERM (15).
Dec 15 20:13:17 herculespre lrmd: [8559]: WARN: operation stop[38] on 
lsb::mysql-horde::mysql-horde-service for client 8562, its parameters: 
CRM_meta_timeout=[20000] crm_feature_set=[3.0.1] : pid [12270] timed out
Dec 15 20:13:17 herculespre crmd: [8562]: ERROR: process_lrm_event: LRM 
operation mysql-horde-service_stop_0 (38) Timed Out (timeout=20000ms)
Dec 15 20:13:17 herculespre attrd: [8560]: info: attrd_ais_dispatch: Update 
relayed from jasonpre
Dec 15 20:13:17 herculespre attrd: [8560]: info: attrd_trigger_update: Sending 
flush op to all hosts for: fail-count-mysql-horde-service (INFINITY)
Dec 15 20:13:17 herculespre attrd: [8560]: info: attrd_perform_update: Sent 
update 43: fail-count-mysql-horde-service=INFINITY
Dec 15 20:13:17 herculespre attrd: [8560]: info: attrd_ais_dispatch: Update 
relayed from jasonpre
Dec 15 20:13:17 herculespre attrd: [8560]: info: attrd_trigger_update: Sending 
flush op to all hosts for: last-failure-mysql-horde-service (1260904397)
Dec 15 20:13:17 herculespre attrd: [8560]: info: attrd_perform_update: Sent 
update 45: last-failure-mysql-horde-service=1260904397
Dec 15 20:15:03 herculespre cib: [8558]: info: cib_stats: Processed 18 
operations (0.00us average, 0% utilization) in the last 10min

091215 20:13:14 [Note] /usr/local/etc2/mysql-horde/libexec/mysqld: Normal 
shutdown
091215 20:13:14 [Note] Event Scheduler: Purging the queue. 0 events
091215 20:13:16 [Warning] /usr/local/etc2/mysql-horde/libexec/mysqld: Forcing 
close of thread 123252 user: 'root'
091215 20:13:17 [Note] /usr/local/etc2/mysql-horde/libexec/mysqld: Shutdown 
complete
091215 20:13:17 mysqld_safe mysqld from pid file 
/usr/local/etc2/mysql-horde/var/herculespre.pid ended

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

[Pacemaker] Problem: monitor timeout causes cluster resource unmanaged and stopped on both nodes.

Reply via email to