Hi everyone,

I am running a two-node cluster which hosts two Xen VMs. We're using DRBD, but it's managed directly from Xen.

The configuration of one of this resources is as follows:

primitive xen-vm1 ocf:heartbeat:Xen
        params xmfile="/etc/xen/vm1.cfg"
        op monitor interval="30s"
        op start interval="0" timeout="60s"
        op stop interval="0" timeout="300s"
        op migrate_from interval="0" timeout="240" ingerval="0"
        op migrate_to interval="0" timeout="240"
        meta allow-migrate="true" target-role="Started"
        meta target-role="Started"


I have a problem with the monitor operation. It seems to be working fine... until it doesn't. The cluster can be running for weeks without any failure, but sometimes the monitor operation fails with a really strange error from the resource agent. This is an excerpt of one of the failures:

Jan 28 14:40:20 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71] (pid 11756) Jan 28 14:40:20 xenhost1 lrmd: [3822]: info: operation monitor[71] on xen-vm1 for client 3825: pid 11756 exited with return code 0 Jan 28 15:40:26 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71] (pid 18065) Jan 28 15:40:27 xenhost1 lrmd: [3822]: info: operation monitor[71] on xen-vm1 for client 3825: pid 18065 exited with return code 0 Jan 28 16:40:32 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71] (pid 24373) Jan 28 16:40:32 xenhost1 lrmd: [3822]: info: operation monitor[71] on xen-vm1 for client 3825: pid 24373 exited with return code 0 Jan 28 17:40:38 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71] (pid 30686) Jan 28 17:40:38 xenhost1 lrmd: [3822]: info: operation monitor[71] on xen-vm1 for client 3825: pid 30686 exited with return code 0 Jan 28 18:40:44 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71] (pid 4593) Jan 28 18:40:44 xenhost1 lrmd: [3822]: info: operation monitor[71] on xen-vm1 for client 3825: pid 4593 exited with return code 0 Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: RA output: (xen-vm1:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/Xen: 71: local: Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: RA output: (xen-vm1:monitor:stderr) en-list: bad variable name Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: RA output: (xen-vm1:monitor:stderr) Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: cancel_op: operation monitor[71] on xen-vm1 for client 3825, its parameters: crm_feature_set=[3.0.6] xmfile=[/etc/xen/vm1.cfg] CRM_meta_name=[monitor] CRM_meta_interval=[30000] CRM_meta_timeout=[20000] cancelled
Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 stop[72] (pid 6219)

The machines are very low on resources, and this unnecessary migration is causing problems.

The systems are running Debian Wheezy with pacemaker 1.1.7-1 and resource-agents 3.9.2-5+deb7u1. I don't know yet if there's a problem with the Xen RA, the lrmd service itself or my configuration. I wasn't able to find any information related to this issue. Do you have any idea of what could be causing this? Any help will be appreciated.

Regards,
Santiago

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to