On 04/05/2016 05:33 PM, Daniel P. Berrange wrote:
On Tue, Apr 05, 2016 at 05:17:41PM +0200, Luis Tomas wrote:
Hi,
We are working on the possibility of including post-copy live migration into
Nova (https://review.openstack.org/#/c/301509/)
At libvirt level, post-copy live migration works as follow:
- Start live migration with a post-copy enabler flag
(VIR_MIGRATE_POSTCOPY). Note this does not mean the migration is performed
in post-copy mode, just that you can switch it to post-copy at any given
time.
- Change the migration from pre-copy to post-copy mode.
However, we are not sure what's the most convenient way of providing this
functionality at Nova level.
The current specs, propose to include an optional flag at the live migration
API to include the VIR_MIGRATE_POSTCOPY flag when starting the live
migration. Then we propose a second API to actually switch the migration
from pre-copy to post-copy mode similarly to how it is done in LibVirt. This
is also similar to how the new "force-migrate" option works to ensure
migrations completion. In fact, this method could be an extension of the
force-migrate, by switching to postcopy if the migration was started with
the VIR_MIGRATE_POSTCOPY libvirt flag, or pause it otherwise.
The cons of this approach are that we expose a too specific mechanism
through the API. To alleviate this, we could remove the "switch" API, and
automatize the switch based on data transferred, available bandwidth or
other related metrics. However we will still need the extension to the
live-migration API to include the proper libvirt postcopy flag.
No we absolutely don't want to expose that in the API as a concept, as it
is private technical implementation detail of the KVM migration code.
I see the point and agree on trying to not expose this as an API,
specially the switch. In fact we implemented as part of the ORBIT EU FP7
project post-copy for OpenStack Juno where the switch to post-copy was
automatically triggered after the first iteration of memory copying.
On the other hand, I still see the point of including a flag to decide
the type of migration in a VM basis. Note that, even though what he have
available right now is the QEMU/LibVirt implementation of post-copy,
post-copy in itself is a live migration type (were the migration process
is driven by the destination VM instead of the source VM), regardless of
how it is implemented underneath. Unlike compression, autoconvergence
and max-downtime, which are extra settings of these type of migrations.
The other solution is to start all the migrations with the
VIR_MIGRATE_POSTCOPY mode, and therefore no new APIs would be needed. The
system could automatically detect the migration is taking too long (or is
dirting memory faster than the sending rate), and automatically switch to
post-copy.
Yes this is what we should be doing as default behaviour with new enough
QEMU IMHO.
The cons of this is that including the VIR_MIGRATE_POSTCOPY flag has an
overhead, and it will not be desirable to included for all migrations,
specially is they can be nicely migrated with pre-copy mode. In addition, if
the migration fails after the switching, the VM will be lost. Therefore,
admins may want to ensure that post-copy is not used for some specific VMs.
We shouldn't be trying to run before we can walk. Even if post-copy
is hurts some guests, it'll still be a net win overall because it will
give a guarantee that migration can complete without needing to stop
guest CPUs entirely. All we need to start with is a nova.conf setting
to let admin turn off use of post-copy for the host for cases where
we want to priortize performance over the ability to migrate successfully.
My concern here is that it is not only performance, but also reliability
as post-copy migrations cannot be recovered in case of a failure during
the migration process.
Any plan wrt changing migration behaviour on a per-VM basis needs to
consider a much broader set of features than just post-copy. For example,
compression, autoconverge and max-downtime settings all have an overhead
or impact on the guest too. We don't want to end up exposing API flags to
turn any of these on/off individually. So any solution to this will have
to look at a combination of usage context and some kind of SLA marker on
the guest. eg if the migration is in the context of host-evacuate which
absolutely must always complete in finite time, we should always use
post-copy. If the migration is in the context of load-balancing workloads
across hosts, then some aspect of guest SLA must inform whether Nova chooses
to use post-copy, or compression or auto-converge, etc.
Regards,
Daniel
Thanks for the valuable input and discussion!
Best regards,
Luis
--
-----------------------------------
Dr. Luis Tomás
Postdoctoral Researcher
Department of Computing Science
Umeå University
l...@cs.umu.se
www.cloudresearch.se
www8.cs.umu.se/~luis
------------------------------------
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev