Re: [Openstack-operators] [openstack-dev] [nova] Interesting bug when unshelving an instance in an AZ and the AZ is gone

Matt Riedemann Mon, 16 Oct 2017 09:36:06 -0700

On 10/16/2017 11:00 AM, Dean Troyer wrote:

[not having a dog in this hunt, this is what I would expect as a cloud consumer]

Thanks for the user perspective, that's what I'm looking for here, andoperator perspective of course.


On Mon, Oct 16, 2017 at 10:22 AM, Matt Riedemann <[email protected]> wrote:

- The user creates an instance in a non-default AZ.
- They shelve offload the instance.
- The admin deletes the AZ that the instance was using, for whatever reason.
- The user unshelves the instance which goes back through scheduling and
fails with NoValidHost because the AZ on the original request spec no longer
exists.

1. How reasonable is it for a user to expect in a stable production
environment that AZs are going to be deleted from under them? We actually
have a spec related to this but with AZ renames:


Change happens...

2. Should we null out the instance.availability_zone when it's shelved
offloaded like we do for the instance.host and instance.node attributes?
Similarly, we would not take into account the RequestSpec.availability_zone
when scheduling during unshelve. I tend to prefer this option because once
you unshelve offload an instance, it's no longer associated with a host and
therefore no longer associated with an AZ. However, is it reasonable to
assume that the user doesn't care that the instance, once unshelved, is no
longer in the originally requested AZ? Probably not a safe assumption.


Agreed, unless we keep track that the user specified a default or no
AZ at create.

We do keep track of what the user originally requested, that is thisRequestSpec object thing I keep referring to.


I think nulling the AZ when the original doesn't exist would be
reasonable from a user standpoint, but I'd feel handcuffed if that
happens and I can not select a new AZ. Or throwing a specific error
and letting the user handle it in #3 below:

At the point of failure, the API has done an RPC cast and returned a 202to the user, so the only way to provide a message like this to the userwould be to check if the original AZ still exists in the API. We coulddo that, it would just be something to be aware of.

3. When a user unshelves, they can't propose a new AZ (and I don't think we
want to add that capability to the unshelve API). So if the original AZ is


Here is my question... if I can specify an AZ on create, why not on
unshelve?  Is it the image location movement under the hood?

I just don't think it's ever come up. The reason I hesitate to add theability to the unshelve API is more or less rooted in my bias toward notliking shelve/unshelve in general because of how complicated andhalf-baked it is (we've had a lot of bugs from these APIs, some of whichare still unresolved). That's not the user's fault though, so one couldargue that if we're not going to deprecate these APIs, we need to makethem more robust. We, as developers, also don't have any idea how manyusers are actually using the shelve API, so it's hard to know if weshould spend any time on improving it.

gone, should we automatically remove the RequestSpec.availability_zone when
scheduling? I tend to not like this as it's very implicit and the user could
see the AZ on their instance change before and after unshelve and be
confused.


Agreed that explicit is better than implicit.

4. We could simply do nothing about this specific bug and assert the
behavior is correct. The user requested an instance in a specific AZ,
shelved that instance and when they wanted to unshelve it, it's no longer
available so it fails. The user would have to delete the instance and create
a new instance from the shelve snapshot image in a new AZ. If we implemented


I do not have the list of things in my head that are preserved in
shelve/unshelve that would be lost in a recreate, but that's where my
worry would come.  Presumably that is why I shelved in the first place
rather than snapshotting the server and removing it.  Depends on the
cost models too, if I lose my grandfathered-in pricing by being forced
to recreate I amy be unhappy.

The volumes and ports remain attached to the shelved instance, only theguest on the hypervisor is destroyed. It doesn't change anything aboutquota - you retain quota usage for a shelved instance so you have roomin your quota to unshelve it later.

From what I can tell, the os-simple-tenant-usage API will still countthe instance and it's consumed disk/ram/cpu against you even though theguest is deleted from the hypervisor while the instance is shelvedoffloaded. So the operator is happy about shelved offloaded instancesbecause that means they have more free capacity for new instances andmoving things, but the user is still getting charged the same, if yourbilling model is based on os-simple-tenant-usage (which Telemetry uses Ibelieve).

Sylvain's spec in #1 above, maybe we don't have this problem going forward
since you couldn't remove/delete an AZ when there are even shelved offloaded
instances still tied to it.


As a user I probably do not mind this, as an operator I'd likely be unhappy.

dt



--

Thanks,

Matt

_______________________________________________
OpenStack-operators mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: [Openstack-operators] [openstack-dev] [nova] Interesting bug when unshelving an instance in an AZ and the AZ is gone

Reply via email to