Re: [Pacemaker] resource starts but then fails right away

Andrew Beekhof Thu, 09 May 2013 19:03:49 -0700

On 10/05/2013, at 12:26 AM, Brian J. Murrell <br...@interlinx.bc.ca> wrote:


> I do see the:
> 
> May  7 02:37:32 node1 crmd[16836]:    error: print_elem: Aborting transition, 
> action lost: [Action 5]: In-flight (id: testfs-resource1_monitor_0, loc: 
> node1, priority: 0)
> 
> in the log.  Is that the root cause of the problem?  

Ordinarily I'd have said yes, but I also see:

May  7 02:36:16 node1 crmd[16836]:     info: delete_resource: Removing resource 
testfs-resource1 for 18002_crm_resource (internal) on node1
May  7 02:36:16 node1 lrmd: [16833]: info: flush_op: process for operation 
monitor[8] on ocf::Target::testfs-resource1 for client 16836 still running, 
flush delayed
May  7 02:36:16 node1 crmd[16836]:     info: lrm_remove_deleted_op: Removing op 
testfs-resource1_monitor_0:8 for deleted resource testfs-resource1

So apparently a badly timed cleanup was run.  Did you do that or was it the crm 
shell?

> If so, what's that
> trying to tell me, exactly?  If not, what is the cause of the problem?
> 
> It really can't be the RA timing out since I give the monitor operation
> a 60 second timeout and the status action of the RA only take a few
> seconds at most to run and is not really an operation that can get
> blocked on anything.  It's effectively the grepping of a file.

If the machine is heavily loaded, or just very busy with file I/O, that can 
still take quite a long time.
I've seen IPaddr monitor actions take over a minute for example.


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] resource starts but then fails right away

Reply via email to