Hello Beekhof.
First of all - I don`t want to waste your time but this problem is realy
important for me and I can`t solve it by my self and it`s looks like a
bug or something. I think what I fail at describing of this problem so I
will try again and try to make a sum of all prev conversation.
I have a situation then pacemaker thinks what resource are running but
it`s not. Agent from console said it`s not running.
I have no fencing and this resource are fail to stop by timeout.
And you said what it`s a reason of this situation. But I made an
experiment and found what if pcmk can`t stop resource it make it "unmanaged"
My resource was not "unmanaged" - it`s just say what they are running
and I have no indication of problem.
We already fix this non stoppable scripts but I want to be sure what I
will not run on this problem any more.
Below some quotes from prev conversation if needed.
12.10.2011 6:11, Andrew Beekhof пишет:
On 10/03/2011 05:32 AM, Andrew Beekhof wrote:
corosync-1.4.1
pacemaker-1.1.5
pacemaker runs with "ver: 1"
2)
This one is scary.
I twice run on situation then pacemaker thinks what resource is
started
but
it is not.
RA is misbehaving. Pacemaker will only consider a resource running if
the RA tells us it is (running or in a failed state).
But you can see below, what agent return "7".
Its still broken. Not one stop action succeeds.
Sep 30 13:58:41 mysender34.mail.ru lrmd: [26299]: WARN:
tranprocessor:stop process (PID 4082) timed out (try 1). Killing with
signal SIGTERM (15).
Sep 30 14:09:34 mysender34.mail.ru lrmd: [26299]: WARN:
tranprocessor:stop process (PID 21859) timed out (try 1). Killing
with signal SIGTERM (15).
Sep 30 20:04:17 mysender34.mail.ru lrmd: [26299]: WARN:
tranprocessor:stop process (PID 24576) timed out (try 1). Killing
with signal SIGTERM (15).
/That/ is why pacemaker thinks its still running.
I made an experiment.
I create script what don`t die at SIGTERM
#!/usr/bin/perl
$SIG{TERM} = "IGNORE"; sleep 1 while 1
And run it on pacemaker.
I run 3 tests:
1) primitive test-kill-15.pl ocf:mail.ru:generic \
op monitor interval="20" timeout="5" on-fail="restart" \
params binfile="/tmp/test-kill-15.pl" external_pidfile="1"
2) Same but on-fail=block
3) Same but with metaware stonith.
Each time I do:
crm resource stop test-kill-15.pl
And in case 1 and 2 - I get "unmanaged" on this resource.
Because you've not configured any fencing devices.
--
Best regards,
Proskurin Kirill
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker