Re: [Pacemaker] Problem: monitor timeout causes cluster resource unmanaged and stopped on both nodes.

Oscar Remírez de Ganuza Satrústegui Thu, 17 Dec 2009 06:10:31 -0800

Hi again,

I have been digging on the documentation, and thought I must answer my own questions, just to share them with the list, maybe someone will find them interesting too.


Oscar Remírez de Ganuza Satrústegui escribió:

What is happening here?? As it appears in the log, the timeout is suposed to
be 20s (20000ms), and the service jsut took 3s to shutdown.
Is it a problem with lrmd?
Looks like it.
It could be that you were unlucky here and that the database
really took around 20 seconds to shutdown. If it is so, then
Oh, thanks! You are right!
The command to shutdown the mysql resource was sent at 20:12:55, but the mysql service did not start shutting down until 20:13:14, finishing at 20:13:17, (22 seconds > timeout (20 s))
How is it possible to change the timeout for start or stop operations?

Have a look here:
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-operation-defaults.html#id525162

please increase your timeouts. You also mentioned somewhere that
5s is set for a monitor timeout, that's way to low for any kind
of resource. There's a chapter on applications in HA environments
in a paper I recently presented (http://tinyurl.com/yg7u4bd).
We had configured very low timeout for the monitors too. When I tried today to change them, even the crm alerted me and advised me:
crm(live)# configure edit
WARNING: mysql-horde-nfs: timeout 10s for monitor_0 is smaller than the advised 40 WARNING: mysql-horde-service: timeout 10s for monitor_0 is smaller than the advised 15
WARNING: pingd: timeout 10s for monitor_0 is smaller than the advised 20
I have read your paper and understand the importance of tunning correctly the timeout values, in order not to cause false positives and unavailabilities.
Just two last questions:
Is it 'normal' to set a resource as "unmanaged" just because the stop operation was timed out once?

As found here (http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-failure-migration.html): "Stop failures are slightly different and crucial. If a resource fails to stop and STONITH is enabled, then the cluster will fence the node in order to be able to start the resource elsewhere. If STONITH is not enabled, then the cluster has no way to continue and will not try to start the resource elsewhere, but will try to stop it again after the failure timeout."

Is it possible to configure the cluster to try more than once to stop a resource? (as it is possible to do for the start operation with the cluster property start-failure-is-fatal="false")

I will configure the attribute failure-timeout and make some tests.

Thank you very much for your time building this software, and helping us to use it!


Regards,

---
Oscar Remírez de Ganuza
Servicios Informáticos
Universidad de Navarra
Ed. de Derecho, Campus Universitario
31080 Pamplona (Navarra), Spain
tfno: +34 948 425600 Ext. 3130
http://www.unav.es/SI

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Re: [Pacemaker] Problem: monitor timeout causes cluster resource unmanaged and stopped on both nodes.

Reply via email to