On 10/5/2018 6:59 PM, melanie witt wrote:
5) when live migration fails due to a internal error rollback is not
handled correctly https://bugs.launchpad.net/nova/+bug/1788014
- Bug was reported on 2018-08-20
- The change that caused the regression landed on 2018-07-26, FF day
https://review.openstack.org/434870
- Unrelated to a blueprint, the regression was part of a bug fix
- Was found because sean-k-mooney was doing live migrations and found
that when a LM failed because of a QEMU internal error, the VM remained
ACTIVE but the VM no longer had network connectivity.
- Question: why wasn't this caught earlier?
- Answer: We would need a live migration job scenario that intentionally
initiates and fails a live migration, then verify network connectivity
after the rollback occurs.
- Question: can we add something like that?
Not in Tempest, no, but we could run something in the
nova-live-migration job since that executes via its own script. We could
hack something in like what we have proposed for testing evacuate:
https://review.openstack.org/#/c/602174/
The trick is figuring out how to introduce a fault in the destination
host without taking down the service, because if the compute service is
down we won't schedule to it.
6) nova-manage db online_data_migrations hangs on instances with no host
set https://bugs.launchpad.net/nova/+bug/1788115
- Bug was reported on 2018-08-21
- The patch that introduced the bug landed on 2018-05-30
https://review.openstack.org/567878
- Unrelated to a blueprint, the regression was part of a bug fix
- Question: why wasn't this caught earlier?
- Answer: To hit the bug, you had to have had instances with no host set
(that failed to schedule) in your database during an upgrade. This does
not happen during the grenade job
- Question: could we add anything to the grenade job that would leave
some instances with no host set to cover cases like this?
Probably - I'd think creating a server on the old side with some
parameters that we know won't schedule would do it, maybe requesting an
AZ that doesn't exist, or some other kind of scheduler hint that we know
won't work so we get a NoValidHost. However, online_data_migrations in
grenade probably don't run on the cell0 database, so I'm not sure we
would have caught that case.
--
Thanks,
Matt
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev