Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

Jay Pipes Thu, 31 May 2018 12:36:55 -0700

On 05/31/2018 01:09 PM, Dan Smith wrote:

My feeling is that we should not attempt to "migrate" any allocations
or inventories between root or child providers within a compute node,
period.


While I agree this is the simplest approach, it does put a lot of
responsibility on the operators to do work to sidestep this issue, which
might not even apply to them (and knowing if it does might be
difficult).

Perhaps, yes. Though the process I described is certainly not foreign tooperators. It is a safe and well-practiced approach.

The virt drivers should simply error out of update_provider_tree() if
there are ANY existing VMs on the host AND the virt driver wishes to
begin tracking resources with nested providers.

The upgrade operation should look like this:

1) Upgrade placement
2) Upgrade nova-scheduler
3) start loop on compute nodes. for each compute node:
  3a) disable nova-compute service on node (to take it out of scheduling)
  3b) evacuate all existing VMs off of node


You mean s/evacuate/cold migrate/ of course... :)

I meant evacuate as in `nova host-evacuate-live` with a fall back to`nova host-servers-migrate` if live migration isn't possible).

  3c) upgrade compute node (on restart, the compute node will see no
      VMs running on the node and will construct the provider tree inside
      update_provider_tree() with an appropriate set of child providers
      and inventories on those child providers)
  3d) enable nova-compute service on node

Which is virtually identical to the "normal" upgrade process whenever
there are significant changes to the compute node -- such as upgrading
libvirt or the kernel.


Not necessarily. It's totally legit (and I expect quite common) to just
reboot the host to take kernel changes, bringing back all the instances
that were there when it resumes.

So, you're saying the normal process is to try upgrading the Linuxkernel and associated low-level libs, wait the requisite amount of timethat takes (can be a long time) and just hope that everything comes backOK? That doesn't sound like any upgrade I've ever seen. All upgradeprocedures I have seen attempt to get the workloads off of the computehost before trying anything major (and upgrading a linux kernel orlow-level lib like libvirt is a major thing IMHO).


> The "normal" case of moving things

around slide-puzzle-style applies to live migration (which isn't anoption here).

Sorry, I was saying that for all the lovely resources that have beenbolted on to Nova in the past 5 years (CPU pinning, NUMA topologies, PCIpassthrough, SR-IOV PF/VFs, vGPUs, etc), that if the workload uses*those* resources, then live migration won't work and that the adminwould need to fall back to nova host-servers-migrate. I wasn't sayingthat live migration for all workloads/instances would not be a possibility.

I think people that can take downtime for the instances would rather
not have to move things around for no reason if the instance has to
get shut off anyway.

Maybe. Not sure. But my line of thinking is stick to a single, alreadyknown procedure since that is safe and well-practiced.

Code that we don't have to write means code that doesn't have new bugsthat we'll have to track down and fix.

I'm also thinking that we'd be tracking down and fixing those bugs whiletrying to put out a fire that was caused by trying to auto-healeverything at once on nova-compute startup and resulting in broken stateand an inability of the nova-compute service to start again, essentiallytrapping instances on the failed host. ;)

Nested resource tracking is another such significant change and should
be dealt with in a similar way, IMHO.


This basically says that for anyone to move to rocky, they will have to
cold migrate every single instance in order to do that upgrade right?

No, sorry if I wasn't clear. They can live-migrate the instances off ofthe to-be-upgraded compute host. They would only need to cold-migrateinstances that use the aforementioned non-movable resources.

I kinda think we need to either:

1. Make everything perform the pivot on compute node start (which can be
    re-used by a CLI tool for the offline case)

2. Make everything default to non-nested inventory at first, and provide
    a way to migrate a compute node and its instances one at a time (in
    place) to roll through.


I would vote for Option #2 if it comes down to it.

If we are going to go through the hassle of writing a bunch oftransformation code in order to keep operator action as low as possible,I would prefer to consolidate all of this code into the nova-manage (ornova-status) tool and put some sort of attribute/marker on each computenode record to indicate whether a "heal" operation has occurred for thatcompute node.


Kinda like what Matt's been playing with for the heal_allocations stuff.

At least in that case, we'd have all the transform/heal code in a singleplace and we wouldn't need to have much, if any, code in the computemanager, resource tracker or "scheduler" report client.

Someone (maybe Gibi?) on this thread had mentioned having the virtdriver (in update_provider_tree) do the whole set reserved = total thingwhen first attempting to create the child providers. That would work toprevent the scheduler from attempting to place workloads on those childproviders, but we would still need some marker on the compute node toindicate to the nova-manage heal_nested_providers (or whatever) commandthat the compute node has had its provider tree validated/healed, right?

We can also document "or do the cold-migration slide puzzle thing" as an
alternative for people that feel that's more reasonable.

I just think that forcing people to take down their data plane to work
around our own data model is kinda evil and something we should be
avoiding at this level of project maturity.

The use of the word "evil" is a little, well, brutal, to describesomething I'm proposing that would just be more work for operators but(again, IMHO) be the safest proven method for solving this problem. :)


Best,
-jay

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

Reply via email to