Dear All,

we currently investigate a problem where some Libvirt/KVM VM in a pacemaker cluster ends up running on two nodes after being migrated.

The problem is triggered when a constraint is inserted that causes the VM to be migrated away from its current node, but then the constraint is removed before the migration_to action has finished[1]. Pacemaker then assumes that the VM is not running anywhere and restarts it on the original node, but libvirt already has started it on the new node.

This observation is from Pcmk 1.0, but I belief the same would happen with Pcmk 1.1 too. And there are possibly other circumstances where this problem would pop up.

I think the root cause for this is that Pacemakers assumption on how a migration works and what Libvirt actually does, are just incompatible.

Pacemaker apparently assumes that "migrate_from" is more or less equivalent to a "stop" and "migrate_from" is basically the same as "start", at least in the sense that the resource is assumed to be stopped after "migrate_from" and later started by a "migrate_from", In between, Pacemaker assumes the resource to be stopped.

But "Libvirt" (live-) migrations just does not work this way. Here, a migration is atomic, meaning the migration is one step after which the VM either runs on the new node or (on failure) is still running on the old node.

Our RA (like the VirtualDomain RA from "cluster-resource-agents") does the Libvirt migration in the "migrate_to" step, "migrate_from" is pretty much a no-op. But this violates Pacemakers assumption on the resource state after the step, because Pacemaker does not expect the resource to be started on the the new node (this is what caused above problem).

Moving the Libvirt migration to "migrate_from" however won't work either, because then "migrate_to" had to be a no-op, which would violate pacemakers assumption that the VM has been stopped after "migrate_to" (tried that, very unpleasant results on failed migrations.And then there is the problem that "migrate_from" may never be called).

Any idea how to solve this ?

Maybe implement a "migrate_atomic" action with the right semantics ?

Ciao
  Andi


[1] insertion and deletion of the constraint in this case is caused by a short DRBD hickup that repairs itself.

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to