On 08/22/2018 08:55 AM, Balázs Gibizer wrote:
On Fri, Aug 17, 2018 at 5:40 PM, Eric Fried <openst...@fried.cc> wrote:
gibi-
- On migration, when we transfer the allocations in either
direction, a
conflict means someone managed to resize (or otherwise change
allocations?) since the last time we pulled data. Given the global
lock
in the report client, this should have been tough to do. If it does
happen, I would think any retry would need to be done all the way back
at the claim, which I imagine is higher up than we should go. So
again,
I think we should fail the migration and make the user retry.
Do we want to fail the whole migration or just the migration step (e.g.
confirm, revert)?
The later means that failure during confirm or revert would put the
instance back to VERIFY_RESIZE. While the former would mean that in
case
of conflict at confirm we try an automatic revert. But for a
conflict at
revert we can only put the instance to ERROR state.
This again should be "impossible" to come across. What would the
behavior be if we hit, say, ValueError in this spot?
I might not totally follow you. I see two options to choose from for the
revert case:
a) Allocation manipulation error during revert of a migration causes
that instance goes to ERROR. -> end user cannot retry the revert the
instance needs to be deleted.
I would say this one is correct, but not because the user did anything
wrong. Rather, *something inside Nova failed* because technically Nova
shouldn't allow resource allocation to change while a server is in
CONFIRMING_RESIZE task state. If we didn't make the server go to an
ERROR state, I'm afraid we'd have no indication anywhere that this
improper situation ever happened and we'd end up hiding some serious
data corruption bugs.
b) Allocation manipulation error during revert of a migration causes
that the instance goes back to VERIFY_RESIZE state. -> end user can
retry the revert via the API.
I see three options to choose from for the confirm case:
a) Allocation manipulation error during confirm of a migration causes
that instance goes to ERROR. -> end user cannot retry the confirm the
instance needs to be deleted.
For the same reasons outlined above, I think this is the only safe option.
Best,
-jay
b) Allocation manipulation error during confirm of a migration causes
that the instance goes back to VERIFY_RESIZE state. -> end user can
retry the confirm via the API.
c) Allocation manipulation error during confirm of a migration causes
that nova automatically tries to revert the migration. (For failure
during this revert the same options available as for the generic revert
case, see above)
We also need to consider live migration. It is similar in a sense that
it also use move_allocations. But it is different as the end user
doesn't explicitly confirm or revert a live migration.
I'm looking for opinions about which option we should take in each cases.
gibi
-efried
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev