This is a design issue that we need to improve in general. However, a simple roll back logic does not solve the problem, since abnormal terminate can happen at any time, which means it can happen in the middle of job cancellation process as well.
Under current architecture, the cleanup work is handled in VM sync process, we allow jobs to cancel or fail at anytime, this design decision may leave temporary failures to operations that are currently carried in the stopping/crashed management server, VM sync process will do self-healing and carry back of the consistency of system data. This design choice itself is still acceptable to a certain level, unfortunately, this process is buggy in current CloudStack releases. The example Marcus gave falls in the category of having bug in re-sync VM in migrating state (basically to fail it and allow user to re-issue the command). I've refactored the modeling used by VM sync process but wasn't able to merge into the main branch for 4.2 release due to concerns from community about its late readiness time for architecture changes. Will reiterate the merge effort after 4.2 release. Kelven On 9/3/13 10:59 AM, "Marcus Sorensen" <shadow...@gmail.com> wrote: >I'm trying to figure out if/how management and agent restarts are >gracefully handled for long running jobs. My initial testing shows >that maybe they aren't. For example, if I try to migrate a storage >volume, and then restart the management server, I end up with two >volumes (source and destination) stuck in migrating state, with the VM >unable to start and the job stating: > > { > "accountid": "505add16-12d8-11e3-8495-5254004eff4f", > "cmd": >"org.apache.cloudstack.api.command.user.volume.MigrateVolumeCmd", > "created": "2013-09-03T11:41:55-0600", > "jobid": "698cc7cf-4ecc-40da-9bcf-261a7921ab95", > "jobprocstatus": 0, > "jobresult": { > "errorcode": 530, > "errortext": "job cancelled because of management >server restart" > }, > "jobresultcode": 530, > "jobresulttype": "object", > "jobstatus": 2, > "userid": "505bd5d6-12d8-11e3-8495-5254004eff4f" > } > >If all jobs react this way, it doesn't seem like a small bug, but >perhaps a design issue. If a job is cancelled, the state should be >rolled back, I think. Perhaps every job should have a cleanup method >that is called when the job is considered cancelled (assuming the >cancellation occurs prior to shutdown, but then that doesn't handle >crashes). > >The end result is that everyone using cloudstack should be terrified >of restarting their mgmt server, I think, especially as their >environment grows and has many things going on. Anything that goes >through a state machine could get stuck.