[Pacemaker] Pcmk migration logic and Libvirt migration behavior

Andreas Hofmeister Wed, 01 May 2013 06:19:41 -0700

Dear All,

we currently investigate a problem where some Libvirt/KVM VM in apacemaker cluster ends up running on two nodes after being migrated.

The problem is triggered when a constraint is inserted that causes theVM to be migrated away from its current node, but then the constraint isremoved before the migration_to action has finished[1]. Pacemaker thenassumes that the VM is not running anywhere and restarts it on theoriginal node, but libvirt already has started it on the new node.

This observation is from Pcmk 1.0, but I belief the same would happenwith Pcmk 1.1 too. And there are possibly other circumstances where thisproblem would pop up.

I think the root cause for this is that Pacemakers assumption on how amigration works and what Libvirt actually does, are just incompatible.

Pacemaker apparently assumes that "migrate_from" is more or lessequivalent to a "stop" and "migrate_from" is basically the same as"start", at least in the sense that the resource is assumed to bestopped after "migrate_from" and later started by a "migrate_from", Inbetween, Pacemaker assumes the resource to be stopped.

But "Libvirt" (live-) migrations just does not work this way. Here, amigration is atomic, meaning the migration is one step after which theVM either runs on the new node or (on failure) is still running on theold node.

Our RA (like the VirtualDomain RA from "cluster-resource-agents") doesthe Libvirt migration in the "migrate_to" step, "migrate_from" is prettymuch a no-op. But this violates Pacemakers assumption on the resourcestate after the step, because Pacemaker does not expect the resource tobe started on the the new node (this is what caused above problem).

Moving the Libvirt migration to "migrate_from" however won't workeither, because then "migrate_to" had to be a no-op, which would violatepacemakers assumption that the VM has been stopped after "migrate_to"(tried that, very unpleasant results on failed migrations.And then thereis the problem that "migrate_from" may never be called).


Any idea how to solve this ?

Maybe implement a "migrate_atomic" action with the right semantics ?

Ciao
  Andi

[1] insertion and deletion of the constraint in this case is caused by ashort DRBD hickup that repairs itself.


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] Pcmk migration logic and Libvirt migration behavior

Reply via email to