Hi, I have a pacemaker setup using the Xen resource agent and I've found something weird during migration: if a VM is in the middle of live-migrating from node 1 to node 2, and I stop the resource in crm, pacemaker forgets about the migration and immediately thinks the resource is stopped, although it doesn't actually call the stop action. Meanwhile, the migration continues and the VM ends up running on node 2.
This can cause problems: let's say you put both nodes into standby one after the other. The cluster starts migrating a VM from node 1 to node 2, then thinks it stops the resource when node 2 goes to standby, but the migration continues and the VM is left running on node 2. Later when the nodes are brought out of standby, the cluster starts the VM on node 1 and hoses the filesystem. Is there a way around this? I'm not sure there is a clean way to abort a Xen live migration, but even if there were, the cluster isn't calling any actions so there'd be no way to trigger the abort. I've tried with op_defaults record-pending="false" and "true", and with and without the monitor op on the Xen resource. Here's part of the log from a run with record-pending="false" and the following Xen primitive: primitive vm-test2 ocf:heartbeat:Xen \ meta allow-migrate="true" target-role="Started" \ op monitor interval="10" \ params xmfile="/etc/xen/vm/vm-test2" Aug 26 15:55:49 xen-test1 pengine: [5147]: info: complex_migrate_reload: Migrating vm-test2 from xen-test1 to xen-test2 Aug 26 15:55:49 xen-test1 pengine: [5147]: notice: LogActions: Migrate resource vm-test2 (Started xen-test1 -> xen-test2) Aug 26 15:55:52 xen-test1 pengine: [5147]: info: complex_migrate_reload: Migrating vm-test2 from xen-test1 to xen-test2 Aug 26 15:55:52 xen-test1 pengine: [5147]: notice: LogActions: Migrate resource vm-test2 (Started xen-test1 -> xen-test2) Aug 26 15:55:58 xen-test1 lrmd: [5145]: info: rsc:vm-test2:40: migrate_to Aug 26 15:55:58 xen-test1 crmd: [5148]: info: te_rsc_command: Initiating action 27: migrate_to vm-test2_migrate_to_0 on xen-test1 (local) Aug 26 15:55:58 xen-test1 crmd: [5148]: info: process_lrm_event: LRM operation vm-test2_monitor_10000 (call=39, status=1, cib-update=0, confirmed=true) Cancelled Aug 26 15:55:58 xen-test1 Xen[17077]: [17109]: INFO: vm-test2: Starting xm migrate to xen-test2 # "crm resource stop vm-test2" was run at this point Aug 26 15:56:07 xen-test1 crmd: [5148]: info: abort_transition_graph: need_abort:59 - Triggered transition abort (complete=0) : Non-status change Aug 26 15:56:07 xen-test1 cib: [5144]: info: log_data_element: cib:diff: + <nvpair id="vm-test2-meta_attributes-target-role" name="target-role" value="Stopped" __crm_diff_marker__="added:top" /> Aug 26 15:56:49 xen-test1 Xen[17077]: [17504]: INFO: vm-test2: xm migrate to xen-test2 succeeded. cluster-glue-1.0.5-0.5.1 corosync-1.2.1-0.5.1 pacemaker-1.1.2-0.2.1 resource-agents-1.0.3-0.3.2 Thanks, Mike _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker