On 31 January 2015 at 05:47, Daniel P. Berrange <berra...@redhat.com> wrote: > In working on a recent Nova migration bug > > https://bugs.launchpad.net/nova/+bug/1414065 > > I had cause to refactor the way the nova libvirt driver monitors live > migration completion/failure/progress. This refactor has opened the > door for doing more intelligent active management of the live migration > process. ... > What kind of things would be the biggest win from Operators' or tenants' > POV ?
Awesome. Couple thoughts from my perspective. Firstly, there's a bunch of situation dependent tuning. One thing Crowbar does really nicely is that you specify the host layout in broad abstract terms - e.g. 'first 10G network link' and so on : some of your settings above like whether to compress page are going to be heavily dependent on the bandwidth available (I doubt that compression is a win on a 100G link for instance, and would be suspect at 10G even). So it would be nice if there was a single dial or two to set and Nova would auto-calculate good defaults from that (with appropriate overrides being available). Operationally avoiding trouble is better than being able to fix it, so I quite like the idea of defaulting the auto-converge option on, or perhaps making it controllable via flavours, so that operators can offer (and identify!) those particularly performance sensitive workloads rather than having to guess which instances are special and which aren't. Being able to cancel the migration would be good. Relatedly being able to restart nova-compute while a migration is going on would be good (or put differently, a migration happening shouldn't prevent a deploy of Nova code: interlocks like that make continuous deployment much harder). If we can't already, I'd like as a user to be able to see that the migration is happening (allows diagnosis of transient issues during the migration). Some ops folk may want to hide that of course. I'm not sure that automatically rolling back after N minutes makes sense : if the impact on the cluster is significant then 1 minute vs 10 doesn't instrinsically matter: what matters more is preventing too many concurrent migrations, so that would be another feature that I don't think we have yet: don't allow more than some N inbound and M outbound live migrations to a compute host at any time, to prevent IO storms. We may want to log with NOTIFICATION migrations that are still progressing but appear to be having trouble completing. And of course an admin API to query all migrations in progress to allow API driven health checks by monitoring tools - which gives the power to manage things to admins without us having to write a probably-too-simple config interface. HTH, Rob -- Robert Collins <rbtcoll...@hp.com> Distinguished Technologist HP Converged Cloud _______________________________________________ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators