Hello everybody,We have a shared disk cluster with pacemaker (1.0.6) and corosync (1.1.2). Two nodes, active (herculespre) - passive (jasonpre). MySQL service as a resource, inside a group with attribute migration-threshold="5" (so that if a resource fail-count is 5 or more, it will be moved to the passive node)
Yesterday we launched some mysql benchmarking and found out that a resource monitor timeout brought us to the resource unmanaged and stopped on both nodes!!
> crm_mon ... Resource Group: mysql-horde-group mysql-horde-nfs (ocf::heartbeat:Filesystem): Started herculespremysql-horde-service (lsb:mysql-horde): Started herculespre (unmanaged) FAILED <------ !!!!
I can not understand why this situation came out. Maybe we have not properly configured the cluster? Or could it be another problem?
Analyzing the problem step by step: 1. The resource monitor timed out, and the fail-count got updated +1:Dec 15 20:12:54 herculespre lrmd: [8559]: WARN: mysql-horde-service:monitor process (PID 12229) timed out (try 1). Killing with signal SIGTERM (15). Dec 15 20:12:54 herculespre lrmd: [8559]: WARN: operation monitor[37] on lsb::mysql-horde::mysql-horde-service for client 8562, its parameters: CRM_meta_interval=[10000] CRM_meta_timeout=[5000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor] : pid [12229] timed out Dec 15 20:12:54 herculespre crmd: [8562]: ERROR: process_lrm_event: LRM operation mysql-horde-service_monitor_10000 (37) Timed Out (timeout=5000ms) Dec 15 20:12:54 herculespre attrd: [8560]: info: attrd_local_callback: Expanded fail-count-mysql-horde-service=value++ to 1
That's ok. It could be that our mysql service was busy performing all the test operations, and took more that 5s to answer.
2. The CRM decided to stop the service.Dec 15 20:12:55 herculespre crmd: [8562]: info: do_lrm_rsc_op: Performing key=4:1379:0:ae99a943-f4b7-4979-b0c9-09c7f9dd0f9f op=mysql-horde-service_stop_0 ) Dec 15 20:12:55 herculespre lrmd: [8559]: info: rsc:mysql-horde-service:38: stop
3. The MySQL service received the order and shutted down properly. From mysql.log: 091215 20:13:14 [Note] /usr/local/etc2/mysql-horde/libexec/mysqld: Normal shutdown
...091215 20:13:17 [Note] /usr/local/etc2/mysql-horde/libexec/mysqld: Shutdown complete
4. Here comes the problem: the cluster did not received the confirmation that the service was properly shutted down: Dec 15 20:13:17 herculespre lrmd: [8559]: WARN: mysql-horde-service:stop process (PID 12270) timed out (try 1). Killing with signal SIGTERM (15). Dec 15 20:13:17 herculespre lrmd: [8559]: WARN: operation stop[38] on lsb::mysql-horde::mysql-horde-service for client 8562, its parameters: CRM_meta_timeout=[20000] crm_feature_set=[3.0.1] : pid [12270] timed out Dec 15 20:13:17 herculespre crmd: [8562]: ERROR: process_lrm_event: LRM operation mysql-horde-service_stop_0 (38) Timed Out (timeout=20000ms)
What is happening here?? As it appears in the log, the timeout is suposed to be 20s (20000ms), and the service jsut took 3s to shutdown.
Is it a problem with lrmd? 5. As a stop-action has failed, the fail-count got updated to INFINITYDec 15 20:13:17 herculespre attrd: [8560]: info: attrd_perform_update: Sent update 43: fail-count-mysql-horde-service=INFINITY
The final output is that the resource is unmanaged and stopped on both nodes!!!
We have not configured any stonith resource, as the servers are on Vmware Virtual Center and we have not found a working solution. We could not stopped the crmd process properly, and we had to kill it. When we rebooted the node, this situation got solved, so we reckon that with a stonith resource this could have been solved automatically. But we still do not understand what happened here and why the resource got "unmanaged".
Could this be a bug?I attached the full logs and the configuration, as I did not want to spam the list.
We will highly appreciate any help understanding what could have happened wrong.
Thank you very much for your help!! Regards, --- Oscar Remírez de Ganuza Servicios Informáticos Universidad de Navarra Ed. de Derecho, Campus Universitario 31080 Pamplona (Navarra), Spain tfno: +34 948 425600 Ext. 3130 http://www.unav.es/SI
crm(live)# configure show node herculespre \ attributes standby="off" node jasonpre \ attributes standby="off" primitive mysql-horde-nfs ocf:heartbeat:Filesystem \ params directory="/usr/local/etc2/mysql-horde/var/" fstype="nfs" device="X.X.X.X:/nfs_partition" options="auto,rw,timeo=600,soft,intr,wsize=32768,rsize=32768,tcp" \ meta target-role="Started" \ op monitor interval="10s" timeout="5s" primitive mysql-horde-service lsb:mysql-horde \ meta target-role="Started" \ op monitor interval="10s" timeout="5s" primitive pingd ocf:pacemaker:pingd \ params host_list="X.X.X.X" multiplier="100" \ op monitor interval="15s" timeout="5s" \ meta target-role="Started" group mysql-horde-group mysql-horde-nfs mysql-horde-service \ meta migration-threshold="5" clone pingdclone pingd \ meta globally-unique="false" location mysql-horde-location mysql-horde-group 100: herculespre location mysql-horde-location-online mysql-horde-group \ rule $id="mysql-horde-location-online-rule" -inf: not_defined pingd or pingd lte 0 property $id="cib-bootstrap-options" \ dc-version="1.0.6-f709c638237cdff7556cb6ab615f32826c0f8c06" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ start-failure-is-fatal="false" \ stonith-action="reboot" \ last-lrm-refresh="1258714605"
============ Last updated: Tue Dec 15 20:13:17 2009 Stack: openais Current DC: jasonpre - partition with quorum Version: 1.0.6-f709c638237cdff7556cb6ab615f32826c0f8c06 2 Nodes configured, 2 expected votes 4 Resources configured. ============ Online: [ herculespre jasonpre ] Clone Set: pingdclone Started: [ herculespre jasonpre ] Resource Group: mysql-horde-group mysql-horde-nfs (ocf::heartbeat:Filesystem): Started herculespre mysql-horde-service (lsb:mysql-horde): Started herculespre (unmanaged) FAILED ... Failed actions: mysql-horde-service_monitor_10000 (node=herculespre, call=37, rc=-2, status=Timed Out): unknown exec error mysql-horde-service_stop_0 (node=herculespre, call=38, rc=-2, status=Timed Out): unknown exec error
Dec 15 20:12:54 herculespre lrmd: [8559]: WARN: mysql-horde-service:monitor process (PID 12229) timed out (try 1). Killing with signal SIGTERM (15). Dec 15 20:12:54 herculespre lrmd: [8559]: WARN: operation monitor[37] on lsb::mysql-horde::mysql-horde-service for client 8562, its parameters: CRM_meta_interval=[10000] CRM_meta_timeout=[5000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor] : pid [12229] timed out Dec 15 20:12:54 herculespre crmd: [8562]: ERROR: process_lrm_event: LRM operation mysql-horde-service_monitor_10000 (37) Timed Out (timeout=5000ms) Dec 15 20:12:54 herculespre attrd: [8560]: info: attrd_ais_dispatch: Update relayed from jasonpre Dec 15 20:12:54 herculespre attrd: [8560]: info: attrd_local_callback: Expanded fail-count-mysql-horde-service=value++ to 1 Dec 15 20:12:54 herculespre attrd: [8560]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-mysql-horde-service (1) Dec 15 20:12:55 herculespre crmd: [8562]: info: do_lrm_rsc_op: Performing key=4:1379:0:ae99a943-f4b7-4979-b0c9-09c7f9dd0f9f op=mysql-horde-service_stop_0 ) Dec 15 20:12:55 herculespre lrmd: [8559]: info: rsc:mysql-horde-service:38: stop Dec 15 20:12:55 herculespre lrmd: [12270]: WARN: For LSB init script, no additional parameters are needed. Dec 15 20:12:55 herculespre crmd: [8562]: info: process_lrm_event: LRM operation mysql-horde-service_monitor_10000 (call=37, status=1, cib-update=0, confirmed=true) Cancelled Dec 15 20:12:55 herculespre attrd: [8560]: info: attrd_perform_update: Sent update 38: fail-count-mysql-horde-service=1 Dec 15 20:12:55 herculespre attrd: [8560]: info: attrd_ais_dispatch: Update relayed from jasonpre Dec 15 20:13:13 herculespre lrmd: [8559]: info: RA output: (mysql-horde-service:stop:stdout) Shutting down MySQL. Dec 15 20:13:17 herculespre attrd: [8560]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-mysql-horde-service (1260904374) Dec 15 20:13:17 herculespre lrmd: [8559]: info: RA output: (mysql-horde-service:stop:stdout) .. Dec 15 20:13:17 herculespre attrd: [8560]: info: attrd_perform_update: Sent update 41: last-failure-mysql-horde-service=1260904374 Dec 15 20:13:17 herculespre lrmd: [8559]: WARN: mysql-horde-service:stop process (PID 12270) timed out (try 1). Killing with signal SIGTERM (15). Dec 15 20:13:17 herculespre lrmd: [8559]: WARN: operation stop[38] on lsb::mysql-horde::mysql-horde-service for client 8562, its parameters: CRM_meta_timeout=[20000] crm_feature_set=[3.0.1] : pid [12270] timed out Dec 15 20:13:17 herculespre crmd: [8562]: ERROR: process_lrm_event: LRM operation mysql-horde-service_stop_0 (38) Timed Out (timeout=20000ms) Dec 15 20:13:17 herculespre attrd: [8560]: info: attrd_ais_dispatch: Update relayed from jasonpre Dec 15 20:13:17 herculespre attrd: [8560]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-mysql-horde-service (INFINITY) Dec 15 20:13:17 herculespre attrd: [8560]: info: attrd_perform_update: Sent update 43: fail-count-mysql-horde-service=INFINITY Dec 15 20:13:17 herculespre attrd: [8560]: info: attrd_ais_dispatch: Update relayed from jasonpre Dec 15 20:13:17 herculespre attrd: [8560]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-mysql-horde-service (1260904397) Dec 15 20:13:17 herculespre attrd: [8560]: info: attrd_perform_update: Sent update 45: last-failure-mysql-horde-service=1260904397 Dec 15 20:15:03 herculespre cib: [8558]: info: cib_stats: Processed 18 operations (0.00us average, 0% utilization) in the last 10min
091215 20:13:14 [Note] /usr/local/etc2/mysql-horde/libexec/mysqld: Normal shutdown 091215 20:13:14 [Note] Event Scheduler: Purging the queue. 0 events 091215 20:13:16 [Warning] /usr/local/etc2/mysql-horde/libexec/mysqld: Forcing close of thread 123252 user: 'root' 091215 20:13:17 [Note] /usr/local/etc2/mysql-horde/libexec/mysqld: Shutdown complete 091215 20:13:17 mysqld_safe mysqld from pid file /usr/local/etc2/mysql-horde/var/herculespre.pid ended
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker