Daniel

Thanks.

We will need to do some work to recreate the instance performance and disk i/o issues and investigate further.

My original message did not go out to the mailing list due to an subscription issue, so including it here


I'm just starting work on Nova upstream having been focused on live
migration orchestration in our large Public Cloud environment.  We were
trying to use live migration to do rolling reboots of compute nodes in order to apply software patches that required node or virtual machine restarts to apply. For this sort of activity to work on a large scale the orchestration needs to be highly automated and integrate with the operations monitoring and issue tracking systems. It also needs the mechanism used to move instances to be highly robust.

However the most significant impediment we encountered was customer
complaints about performance of instances during migration. We did a little bit of work to identify the cause of this and concluded that the main issues was disk i/o contention. I wonder if this is something you or others have encountered? I'd be interested in any idea for managing the rate of the migration processing to prevent it from adversely impacting the customer application performance. I appreciate that if we throttle the migration processing it will take longer and may not be able to keep up with the rate of disk/memory change in the instance.

Could you point me at somewhere I can get details of the tuneable setting relating to cutover down time please? I'm assuming that at these are libvirt/qemu settings? I'd like to play with them in our test environment to see if we can simulate busy instances and determine what works. I'd also be happy to do some work to expose these in nova so the cloud operator can tweak if necessary?

I understand that you have added some functionality to the nova compute
manager to collect data on migration progress and emit this to the log file.

I'd like to propose that we extend this to emit notification message
containing progress information so a cloud operator's orchestration can
consume these events and use them to monitor progress of individual
migrations. This information could be used to generate alerts or tickets so that support staff can intervene. The smarts in qemu to help it make progress are very welcome and necessary but in my experience the cloud operator needs to be able to manage these and if it is necessary to slow down or even pause a customer's instance to complete the migration the cloudoperator may need to gain customer consent before proceeding.

I am also considering submitting a proposal to build on the current spec for monitoring and cancelling migrations to make the migration status information available to users (based on policy setting) and include an estimated time to complete information in the response. I appreciate that this would only be an 'estimate' but it may give the user some idea of how long they will need to wait until they can perform operations on their instance that are not currently permitted during migration. To cater for the scenario where a customer urgently needs to perform an inhibited operation (like attach or detach a volume) then I would propose that we allow for a user to cancel the migration of their own instances. This would be enabled for authorized users based on granting them a specific role.

More thoughts Monday!




-----Original Message-----
From: Daniel P. Berrange [mailto:[email protected]]
Sent: 21 September 2015 09:56
To: Carlton, Paul (Cloud Services)
Cc: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [nova] live migration in Mitaka

On Fri, Sep 18, 2015 at 05:47:31PM +0000, Carlton, Paul (Cloud Services) wrote:
However the most significant impediment we encountered was customer
complaints about performance of instances during migration.  We did a
little bit of work to identify the cause of this and concluded that
the main issues was disk i/o contention.  I wonder if this is
something you or others have encountered?  I'd be interested in any
idea for managing the rate of the migration processing to prevent it
from adversely impacting the customer application performance.  I
appreciate that if we throttle the migration processing it will take
longer and may not be able to keep up with the rate of disk/memory change in
the instance.

I would not expect live migration to have an impact on disk I/O, unless your storage is network based and using the same network as the migration data. While migration is taking place you'll see a small impact on the guest compute performance, due to page table dirty bitmap tracking, but that shouldn't appear directly as disk I/O problem. There is no throttling of guest I/O at all during migration.

Could you point me at somewhere I can get details of the tuneable
setting relating to cutover down time please?  I'm assuming that at
these are libvirt/qemu settings?  I'd like to play with them in our
test environment to see if we can simulate busy instances and
determine what works.  I'd also be happy to do some work to expose
these in nova so the cloud operator can tweak if necessary?

It is already exposed as 'live_migration_downtime' along with live_migration_downtime_steps, and live_migration_downtime_delay. Again, it shouldn't have any impact on guest performance while live migration is taking place. It only comes into effect when checking whether the guest is ready to switch to the new host.

I understand that you have added some functionality to the nova
compute manager to collect data on migration progress and emit this to the
log file.
I'd like to propose that we extend this to emit notification message
containing progress information so a cloud operator's orchestration
can consume these events and use them to monitor progress of
individual migrations.  This information could be used to generate
alerts or tickets so that support staff can intervene.  The smarts in
qemu to help it make progress are very welcome and necessary but in my
experience the cloud operator needs to be able to manage these and if
it is necessary to slow down or even pause a customer's instance to
complete the migration the cloud operator may need to gain customer consent
before proceeding.

We already update the Nova instance object's 'progress' value with the info on the migration progress. IIRC, this is visible via 'nova show <instance>'
or something like that.

Regards,
Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to