On Sat, Apr 16, 2016 at 1:18 AM, Zane Bitter <zbit...@redhat.com> wrote:
> On 15/04/16 10:58, Anant Patil wrote: > >> On 14-Apr-16 23:09, Zane Bitter wrote: >> >>> On 11/04/16 04:51, Anant Patil wrote: >>> >>>> After lot of ping-pong in my head, I have taken a different approach to >>>> implement stack-update-cancel when convergence is on. Polling for >>>> traversal update in each heat engine worker is not efficient method and >>>> so is the broadcasting method. >>>> >>>> In the new implementation, when a stack-cancel-update request is >>>> received, the heat engine worker will immediately cancel eventlets >>>> running locally for the stack. Then it sends cancel messages to only >>>> those heat engines who are working on the stack, one request per engine. >>>> >>> >>> I'm concerned that this is forgetting the reason we didn't implement >>> this in convergence in the first place. The purpose of >>> stack-cancel-update is to roll the stack back to its pre-update state, >>> not to unwedge blocked resources. >>> >>> >> Yes, we thought this was never needed because we consciously decided >> that the concurrent update feature would suffice the needs of user. >> Exactly the reason for me to implement this so late. But there were >> questions for API compatibility, and what if user really wants to cancel >> the update, given that he/she knows the consequence of it? >> > > Cool, we are on the same page then :) > > The problem with just killing a thread is that the resource gets left in >>> an unknown state. (It's slightly less dangerous if you do it only during >>> sleeps, but still the state is indeterminate.) As a result, we mark all >>> such resources UPDATE_FAILED, and anything (apart from nested stacks) in >>> a FAILED state is liable to be replaced on the next update (straight >>> away in the case of a rollback). That's why in convergence we just let >>> resources run their course rather than cancelling them, and of course we >>> are able to do so because they don't block other operations on the stack >>> until they reach the point of needing to operate on that particular >>> resource. >>> >>> >> The eventlet returns after each "step", so it's not that bad, but I do >> > > Yeah, I saw you implemented it that way, and this is a *big* improvement. > That will help avoid bugs like > http://lists.openstack.org/pipermail/openstack-dev/2016-January/084467.html > > agree that the resource might not be in a state from where it can >> "resume", and hence the update-replace. >> > > The issue is that Heat *always* moves the resource to FAILED and therefore > it is *always* replaced in the future, even if it would have completed fine. > > So doing some trivial change that is guaranteed to happen in-place could > result in your critical resource that must never be replaced (e.g. Cinder > volume) being replaced if you happen to cancel the update at just the wrong > moment. I agree with you for the need to have a mechanism to just stop doing the update (or whatever heat was doing to that resource :)) > > I acknowledge your concern here, >> but I see that the user really knows that the stack is stuck because of >> some unexpected failure which heat is not aware of, and wants to cancel >> it. >> > > I think there's two different use cases here: (1) just stop the update and > don't start updating any more resources (and maybe roll back what has > already been done); and (2) kill the update on this resource that is stuck. > Using the same command for both is likely to cause trouble for people who > were only wanting the first one. > > The other option would be to have stack-cancel-update just do (1) by > default, but add a --cancel-me-harder option that also does (2). > > That leaves the problem of what to do when you _know_ a resource is >>> going to fail, you _want_ to replace it, and you don't want to wait for >>> the stack timeout. (In theory this problem will go away when Phase 2 of >>> convergence is fully implemented, but I agree we need a solution for >>> Phase 1.) Now that we have the mark-unhealthy API,[1] that seems to me >>> like a better candidate for the functionality to stop threads than >>> stack-cancel-update is, since its entire purpose in life is to set a >>> resource into a FAILED state so that it will get replaced on the next >>> stack update. >>> >>> So from a user's perspective, they would issue stack-cancel-update to >>> start the rollback, and iff that gets stuck waiting on a resource that >>> is doomed to fail eventually and which they just want to replace, they >>> can issue resource-mark-unhealthy to just stop that resource. >>> >>> >> I was thinking of having the rollback optional while cancelling the >> update. The user may want to cancel the update and issue a new one, but >> not rollback. >> > > +1, this is a good idea. I originally thought that you'd never want to > leave the stack in an intermediate state, but experience with TripleO > (which can't really do rollbacks) is that sometimes you really do just want > to hit the panic button and stop the world :D Yeah, I have heard folks wanting to just cancel and nothing else. > > > What do you think? >>> >>> >> I think it is a good idea, but I see that a resource can be marked >> unhealthy only after it is done. >> > > Currently, yes. The idea would be to change that so that if it finds the > resource IN_PROGRESS then it kills the thread and makes sure the resource > is in a FAILED state. I Move the resource to CHECK_FAILED? > imagine/hope it wouldn't require big changes to your patch, mostly just > changing where it's triggered from. I will be more comfortable submitting another patch to implement this feature :) > > The trick would be if the stack update is still running and the resource is > currently IN_PROGRESS to make sure that we fail the whole stack update > (rolling back if the user has enabled that). > > IMO, we can probably use the cancel command do this, because when you are marking a resource as unhealthy, you are cancelling any action running on that resource. Would the following be ok? (1) stack-cancel-update <stack_id> will cancel the update, mark cancelled resources failed and rollback (existing stuff) (2) stack-cancel-update <stack_id> --no-rollback will just cancel the update and mark cancelled resources as failed (3) stack-cancel-update <stack_id> <resource_id> ... <resource_id> Just stop the action on given resources, mark as CHECK_FAILED, don't do anything else. The stack won't progress further. Other resources running while cancel-update will complete. (3) is like mark unhealthy when stack is IN_PROGRESS. Also, IMO it doesn't make any sense to run (3) with rollback, as the user just wants to stop some resources. Please correct me if I am wrong. > The cancel update would take care of >> in-progress resources gone bad. I really thought the mark-unhealthy and >> stack-cancel-update were complementing features than contradictory. >> > > I'm relaxed about whether this is implemented as part of the > mark-unhealthy or as a non-default option to cancel-update. The main thing > is not to put IN_PROGRESS resources into a FAILED state by default whenever > the user cancels an update. > > Reusing mark-unhealthy as the trigger for this functionality seemed > appealing because it already has basically the semantics that are going to > get (tell Heat to replace this resource on the next update) so there should > be no surprises for users, and because it offers fine-grained control (at > the resource level rather than the stack level). > I agree. > cheers, > Zane. > > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev