On 25 October 2013 23:46, Day, Phil <philip....@hp.com> wrote: > Hi Folks, > > We're very occasionally seeing problems where a thread processing a create > hangs (and we've seen when taking to Cinder and Glance). Whilst those issues > need to be hunted down in their own rights, they do show up what seems to me > to be a weakness in the processing of delete requests that I'd like to get > some feedback on. > > Delete is the one operation that is allowed regardless of the Instance state > (since it's a one-way operation, and users should always be able to free up > their quota). However when we get a create thread hung in one of these > states, the delete requests when they hit the manager will also block as they > are synchronized on the uuid. Because the user making the delete request > doesn't see anything happen they tend to submit more delete requests. The > Service is still up, so these go to the computer manager as well, and > eventually all of the threads will be waiting for the lock, and the compute > manager will stop consuming new messages. > > The problem isn't limited to deletes - although in most cases the change of > state in the API means that you have to keep making different calls to get > past the state checker logic to do it with an instance stuck in another > state. Users also seem to be more impatient with deletes, as they are > trying to free up quota for other things. > > So while I know that we should never get a thread into a hung state into the > first place, I was wondering about one of the following approaches to address > just the delete case: > > i) Change the delete call on the manager so it doesn't wait for the uuid > lock. Deletes should be coded so that they work regardless of the state of > the VM, and other actions should be able to cope with a delete being > performed from under them. There is of course no guarantee that the delete > itself won't block as well.
I like this. > ii) Record in the API server that a delete has been started (maybe enough to > use the task state being set to DELETEING in the API if we're sure this > doesn't get cleared), and add a periodic task in the compute manager to check > for and delete instances that are in a "DELETING" state for more than some > timeout. Then the API, knowing that the delete will be processes eventually > can just no-op any further delete requests. There may be multiple API servers; global state in an API server seems fraught with issues. > iii) Add some hook into the ServiceGroup API so that the timer could depend > on getting a free thread from the compute manager pool (ie run some no-op > task) - so that of there are no free threads then the service becomes down. > That would (eventually) stop the scheduler from sending new requests to it, > and make deleted be processed in the API server but won't of course help with > commands for other instances on the same host. This seems a little kludgy to me. > iv) Move away from having a general topic and thread pool for all requests, > and start a listener on an instance specific topic for each running instance > on a host (leaving the general topic and pool just for creates and other > non-instance calls like the hypervisor API). Then a blocked task would only > affect request for a specific instance. That seems to suggest instance # topics? Aieee. I don't think that solves the problem anyway, because either a) you end up with a tonne of threads, or b) you have a multiplexing thread with the same potential issue. You could more simply just have a dedicated thread pool for deletes, and have no thread limit on the pool. Of course, this will fail when you OOM :). You could do a dict with instance -> thread for deletes instead, without creating lots of queues. > I'm tending towards ii) as a simple and pragmatic solution in the near term, > although I like both iii) and iv) as being both generally good enhancments - > but iv) in particular feels like a pretty seismic change. > My inclination would be (i) - make deletes nonblocking idempotent with lazy cleanup if resources take a while to tear down. -Rob -- Robert Collins <rbtcoll...@hp.com> Distinguished Technologist HP Converged Cloud _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev