Re: job cancelled because of management server restart

Kelven Yang Tue, 03 Sep 2013 16:21:10 -0700

This is a design issue that we need to improve in general. However, a
simple roll back logic does not solve the problem, since abnormal
terminate can happen at any time, which means it can happen in the middle
of job cancellation process as well.

Under current architecture, the cleanup work is handled in VM sync
process, we allow jobs to cancel or fail at anytime, this design decision
may leave temporary failures to operations that are currently carried in
the stopping/crashed management server, VM sync process will do
self-healing and carry back of the consistency of system data. This design
choice itself is still acceptable to a certain level, unfortunately, this
process is buggy in current CloudStack releases. The example Marcus gave
falls in the category of having bug in re-sync VM in migrating state
(basically to fail it and allow user to re-issue the command).

I've refactored the modeling used by VM sync process but wasn't able to
merge into the main branch for 4.2 release due to concerns from community
about its late readiness time for architecture changes. Will reiterate the
merge effort after 4.2 release.

Kelven 

On 9/3/13 10:59 AM, "Marcus Sorensen" <shadow...@gmail.com> wrote:

>I'm trying to figure out if/how management and agent restarts are
>gracefully handled for long running jobs. My initial testing shows
>that maybe they aren't. For example, if I try to migrate a storage
>volume, and then restart the management server, I end up with two
>volumes (source and destination) stuck in migrating state, with the VM
>unable to start and the job stating:
>
>            {
>                "accountid": "505add16-12d8-11e3-8495-5254004eff4f",
>                "cmd":
>"org.apache.cloudstack.api.command.user.volume.MigrateVolumeCmd",
>                "created": "2013-09-03T11:41:55-0600",
>                "jobid": "698cc7cf-4ecc-40da-9bcf-261a7921ab95",
>                "jobprocstatus": 0,
>                "jobresult": {
>                    "errorcode": 530,
>                    "errortext": "job cancelled because of management
>server restart"
>                },
>                "jobresultcode": 530,
>                "jobresulttype": "object",
>                "jobstatus": 2,
>                "userid": "505bd5d6-12d8-11e3-8495-5254004eff4f"
>            }
>
>If all jobs react this way, it doesn't seem like a small bug, but
>perhaps a design issue. If a job is cancelled, the state should be
>rolled back, I think. Perhaps every job should have a cleanup method
>that is called when the job is considered cancelled (assuming the
>cancellation occurs prior to shutdown, but then that doesn't handle
>crashes).
>
>The end result is that everyone using cloudstack should be terrified
>of restarting their mgmt server, I think, especially as their
>environment grows and has many things going on. Anything that  goes
>through a state machine could get stuck.

Re: job cancelled because of management server restart

Reply via email to