On Thu, Dec 9, 2010 at 2:32 PM, Andrew Beekhof <and...@beekhof.net> wrote: > On Thu, Dec 9, 2010 at 12:14 PM, Evgeniy Ivanov <lolkaanti...@gmail.com> > wrote: >> Hi, >> >> What is a best way to check if PM is still alive? > > "ps axf | grep crmd" is one approach
It just means that crmd is alive, but doesn't give information about its state, e.g. theoretically it can hang in some internal logic (something like "endless loop"). So we need something to ask "Hey, PM! Are your brains still OK?". >> We tried following approach: there is a softdog timer (max value is >> 300s + extra 60s to give PM another chance) initially started and >> checked by third party. Clone named HA_alive fails in monitor (except >> first time), monitor interval is 200s. HA_alive:start should reset >> that softdog timer. It looks like sometimes PM doesn't restart failed >> resource for that 360s with no reason: system is almost IDLE. > > Strange. Should work. Details? It's dual-node cluster based on openais-0.80.3-26.1 and pacemaker-1.0.3-4.1. Solution I've described worked fine on my cluster, but regularly failed without a reason on some another clusters. The logs (/var/log/messages) say, that PM noticed a failure in monitor, but later it didn't restart (no stop and no start) the HA_alive resource, thus in 360s system died. I didn't notice anything else in logs... I will be able to share some /var/log/messages, if I get access to failed clusters. -- Evgeniy Ivanov _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker