Re: [Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue

Proskurin Kirill Sun, 02 Oct 2011 23:56:03 -0700

On 10/03/2011 05:32 AM, Andrew Beekhof wrote:

corosync-1.4.1
pacemaker-1.1.5
pacemaker runs with "ver: 1"

2)
This one is scary.
I twice run on situation then pacemaker thinks what resource is started but
it is not.


RA is misbehaving.  Pacemaker will only consider a resource running if
the RA tells us it is (running or in a failed state).


But you can see below, what agent return "7".

We use slightly modifed version of "anything" agent for our
scripts but they are aware of OCF return codes and other staff.

I run monitoring by our agent from console:
# env -i ; OCF_ROOT=/usr/lib/ocf
OCF_RESKEY_binfile=/usr/local/mpop/bin/my/dialogues_notify.pl
/usr/lib/ocf/resource.d/mail.ru/generic monitor
# generic[14992]: DEBUG: default monitor : 7

So our agent said what it is not running, but pacemaker still think it does.
I runs for 2 days and after I forced to cleanup it. And it find what it`snot
running in seconds.


Did you configure a recurring monitor operation?


Of course. I add my primitive configuration in original letter there is:
op monitor interval="30" timeout="300" on-fail="restart" \

I have this third time and this time I found in logs:
Oct 01 02:00:12 mysender34.mail.ru pengine: [26301]: notice:
unpack_rsc_op: Ignoring expired failure tranprocessor_stop_0 (rc=-2,
magic=2:-2;121:690:0:4c16dc39-1fd3-41f2-b582-0236f6b6eccc) on
mysender34.mail.ru

There is different resource name cos logs from third situation butproblem is same.

3)
This one it confusing and dangerous.

I use failure-timeout on most resources to wipe out temp warn messages from
crm_verify -LV - I use it for monitoring a cluster. All works good but I
found this:

1) Resource can`t start on node and migrate to next one.
2) It can`t start here too and on all other.
3) It is give up and stops. There is many erros about all this in crm_verify
-LV - and it is good.
4) failure-timeout comes and... wipe out all errors.
5) We have stopped resource and all errors are wiped. And we don`t know if
it is stopped by a hands of admin or because of errors.

I think what failure-timeout should not happend on stopped resource.
Any chance to avoid this?

Not sure why you think this is dangerous, the cluster is doing exactly
what you told it to.
If you want resources to stay stopped either set failure-timeout=0
(disabled) or set the target-role to Stopped.

No, I want to use failure-timeout but not wipe out errors then resourceare already stopped by pacemaker because of errors and not by admin hands.


--
Best regards,
Proskurin Kirill

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue

Reply via email to