Re: [Pacemaker] Best way to check if PM is alive

Evgeniy Ivanov Thu, 09 Dec 2010 04:02:16 -0800

On Thu, Dec 9, 2010 at 2:32 PM, Andrew Beekhof <and...@beekhof.net> wrote:
> On Thu, Dec 9, 2010 at 12:14 PM, Evgeniy Ivanov <lolkaanti...@gmail.com> 
> wrote:
>> Hi,
>>
>> What is a best way to check if PM is still alive?
>
> "ps axf | grep crmd" is one approach


It just means that crmd is alive, but doesn't give information about
its state, e.g. theoretically it can hang in some internal logic
(something like  "endless loop"). So we need something to ask "Hey,
PM! Are your brains still OK?".

>> We tried following approach: there is a softdog timer (max value is
>> 300s + extra 60s to give PM another chance) initially started and
>> checked by third party. Clone named HA_alive fails in monitor (except
>> first time), monitor interval is 200s. HA_alive:start should reset
>> that softdog timer. It looks like sometimes PM doesn't restart failed
>> resource for that 360s with no reason: system is almost IDLE.
>
> Strange.  Should work. Details?

It's dual-node cluster based on openais-0.80.3-26.1 and
pacemaker-1.0.3-4.1. Solution I've described worked fine on my
cluster, but regularly failed without a reason on some another
clusters. The logs (/var/log/messages) say, that PM noticed a failure
in monitor, but later it didn't restart (no stop and no start) the
HA_alive resource, thus in 360s system died. I didn't notice anything
else in logs...
I will be able to share some /var/log/messages, if I get access to
failed clusters.


-- 
Evgeniy Ivanov

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] Best way to check if PM is alive

Reply via email to