+1 Regards, Alex
Joshua Harlow <harlo...@yahoo-inc.com> wrote on 26/10/2013 09:29:03 AM: > > An idea that others and I are having for a similar use case in > cinder (or it appears to be similar). > > If there was a well defined state machine/s in nova with well > defined and managed transitions between states then it seems like > this state machine could resume on failure as well as be interrupted > when a "dueling" or preemptable operation arrives (a delete while > being created for example). This way not only would it be very clear > the set of states and transitions but it would also be clear how > preemption occurs (and under what cases). > > Right now in nova there is a distributed and ad-hoc state machine > which if it was more formalized it could inherit some if the > described useful capabilities. It would also be much more resilient > to these types of locking problems that u described. > > IMHO that's the only way these types of problems will be fully be > fixed, not by more queues or more periodic tasks, but by solidifying > & formalizing the state machines that compose the work nova does. > > Sent from my really tiny device... > > > On Oct 25, 2013, at 3:52 AM, "Day, Phil" <philip....@hp.com> wrote: > > > > Hi Folks, > > > > We're very occasionally seeing problems where a thread processing > a create hangs (and we've seen when taking to Cinder and Glance). > Whilst those issues need to be hunted down in their own rights, they > do show up what seems to me to be a weakness in the processing of > delete requests that I'd like to get some feedback on. > > > > Delete is the one operation that is allowed regardless of the > Instance state (since it's a one-way operation, and users should > always be able to free up their quota). However when we get a > create thread hung in one of these states, the delete requests when > they hit the manager will also block as they are synchronized on the > uuid. Because the user making the delete request doesn't see > anything happen they tend to submit more delete requests. The > Service is still up, so these go to the computer manager as well, > and eventually all of the threads will be waiting for the lock, and > the compute manager will stop consuming new messages. > > > > The problem isn't limited to deletes - although in most cases the > change of state in the API means that you have to keep making > different calls to get past the state checker logic to do it with an > instance stuck in another state. Users also seem to be more > impatient with deletes, as they are trying to free up quota for other things. > > > > So while I know that we should never get a thread into a hung > state into the first place, I was wondering about one of the > following approaches to address just the delete case: > > > > i) Change the delete call on the manager so it doesn't wait for > the uuid lock. Deletes should be coded so that they work regardless > of the state of the VM, and other actions should be able to cope > with a delete being performed from under them. There is of course > no guarantee that the delete itself won't block as well. > > > > ii) Record in the API server that a delete has been started (maybe > enough to use the task state being set to DELETEING in the API if > we're sure this doesn't get cleared), and add a periodic task in the > compute manager to check for and delete instances that are in a > "DELETING" state for more than some timeout. Then the API, knowing > that the delete will be processes eventually can just no-op any > further delete requests. > > > > iii) Add some hook into the ServiceGroup API so that the timer > could depend on getting a free thread from the compute manager pool > (ie run some no-op task) - so that of there are no free threads then > the service becomes down. That would (eventually) stop the scheduler > from sending new requests to it, and make deleted be processed in > the API server but won't of course help with commands for other > instances on the same host. > > > > iv) Move away from having a general topic and thread pool for all > requests, and start a listener on an instance specific topic for > each running instance on a host (leaving the general topic and pool > just for creates and other non-instance calls like the hypervisor > API). Then a blocked task would only affect request for a specificinstance. > > > > I'm tending towards ii) as a simple and pragmatic solution in the > near term, although I like both iii) and iv) as being both generally > good enhancments - but iv) in particular feels like a pretty seismic change. > > > > Thoughts please, > > > > Phil > > > > _______________________________________________ > > OpenStack-dev mailing list > > OpenStack-dev@lists.openstack.org > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev