Hello Beekhof.

First of all - I don`t want to waste your time but this problem is realy important for me and I can`t solve it by my self and it`s looks like a bug or something. I think what I fail at describing of this problem so I will try again and try to make a sum of all prev conversation.

I have a situation then pacemaker thinks what resource are running but it`s not. Agent from console said it`s not running.
I have no fencing and this resource are fail to stop by timeout.
And you said what it`s a reason of this situation. But I made an experiment and found what if pcmk can`t stop resource it make it "unmanaged"

My resource was not "unmanaged" - it`s just say what they are running and I have no indication of problem.

We already fix this non stoppable scripts but I want to be sure what I will not run on this problem any more.

Below some quotes from prev conversation if needed.

12.10.2011 6:11, Andrew Beekhof пишет:
On 10/03/2011 05:32 AM, Andrew Beekhof wrote:

corosync-1.4.1
pacemaker-1.1.5
pacemaker runs with "ver: 1"

2)
This one is scary.
I twice run on situation then pacemaker thinks what resource is
started
but
it is not.

RA is misbehaving.  Pacemaker will only consider a resource running if
the RA tells us it is (running or in a failed state).

But you can see below, what agent return "7".

Its still broken. Not one stop action succeeds.

Sep 30 13:58:41 mysender34.mail.ru lrmd: [26299]: WARN:
tranprocessor:stop process (PID 4082) timed out (try 1).  Killing with
signal SIGTERM (15).
Sep 30 14:09:34 mysender34.mail.ru lrmd: [26299]: WARN:
tranprocessor:stop process (PID 21859) timed out (try 1).  Killing
with signal SIGTERM (15).
Sep 30 20:04:17 mysender34.mail.ru lrmd: [26299]: WARN:
tranprocessor:stop process (PID 24576) timed out (try 1).  Killing
with signal SIGTERM (15).

/That/ is why pacemaker thinks its still running.

I made an experiment.

I create script what don`t die at SIGTERM

#!/usr/bin/perl
$SIG{TERM} = "IGNORE"; sleep 1 while 1

And run it on pacemaker.
I run 3 tests:
1) primitive test-kill-15.pl ocf:mail.ru:generic \
        op monitor interval="20" timeout="5" on-fail="restart" \
        params binfile="/tmp/test-kill-15.pl" external_pidfile="1"

2) Same but on-fail=block

3) Same but with metaware stonith.

Each time I do:
crm resource stop test-kill-15.pl

And in case 1 and 2 - I get "unmanaged" on this resource.

Because you've not configured any fencing devices.


--
Best regards,
Proskurin Kirill

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Reply via email to