On 10/05/2013, at 12:26 AM, Brian J. Murrell <br...@interlinx.bc.ca> wrote:
> I do see the: > > May 7 02:37:32 node1 crmd[16836]: error: print_elem: Aborting transition, > action lost: [Action 5]: In-flight (id: testfs-resource1_monitor_0, loc: > node1, priority: 0) > > in the log. Is that the root cause of the problem? Ordinarily I'd have said yes, but I also see: May 7 02:36:16 node1 crmd[16836]: info: delete_resource: Removing resource testfs-resource1 for 18002_crm_resource (internal) on node1 May 7 02:36:16 node1 lrmd: [16833]: info: flush_op: process for operation monitor[8] on ocf::Target::testfs-resource1 for client 16836 still running, flush delayed May 7 02:36:16 node1 crmd[16836]: info: lrm_remove_deleted_op: Removing op testfs-resource1_monitor_0:8 for deleted resource testfs-resource1 So apparently a badly timed cleanup was run. Did you do that or was it the crm shell? > If so, what's that > trying to tell me, exactly? If not, what is the cause of the problem? > > It really can't be the RA timing out since I give the monitor operation > a 60 second timeout and the status action of the RA only take a few > seconds at most to run and is not really an operation that can get > blocked on anything. It's effectively the grepping of a file. If the machine is heavily loaded, or just very busy with file I/O, that can still take quite a long time. I've seen IPaddr monitor actions take over a minute for example. _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org