Re: Making sense of Aurora terminal states

Bill Farner Thu, 19 Feb 2015 14:36:36 -0800

On Thu, Feb 19, 2015 at 1:27 PM, Hussein Elgridly <
huss...@broadinstitute.org> wrote:


> I've just spent the afternoon making a flowchart out of
> TaskStateMachine.java in an attempt to figure out what Aurora states
> actually mean. Given that all the jobs I submit have unique names and I
> don't permit retries, I would like to put together a set of rules that
> determine whether a job is _really_ terminal and definitely won't be
> rescheduled.
>
> Would one of the Aurora devs be willing to play a game of True or False
> with the following statements?
>
> 1. If all my job names are unique and I do an aurora job status
> --write-json, there will be at most one element in the "active" list.
>

True iff the job has only one instance.


> 2. Jobs in the "inactive" list are ordered by last update time, most recent
> first.
>

False.  They are sorted by instance ID [1], which doesn't make much sense.

[1]
https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/src/main/python/apache/aurora/client/cli/jobs.py#L635-L636


> 3. A job's "status" will always equal the status of the last item in its
> "taskEvents" list.
>

True.


> 4. The full list of terminal states is [LOST, FINISHED, FAILED, KILLED]. A
> job that is not in one of these states will undergo more transitions and
> will remain in the "active" list until it gets to one of these states.
> (Will I ever see DELETED, or do they not show up in aurora job status?)
>

True.  Source of truth is [1].  We actually don't have a state [2] for
DELETED.

[1]
https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L410-L413
[2]
https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L348-L380


> 5. A job in the LOST state will always be rescheduled unless it went
> through KILLING first. (What does this represent - killed by user and then
> lost connectivity to the slave?)
>

True.  That is one way it could happen, it could also happen if the
scheduler times the task out while waiting to hear back from mesos after
attempting to kill the task.


> 6. A job will be rescheduled if if it goes through one of [RESTARTING,
> DRAINING, PREEMPTING].
>

True.


> 7. Assuming maxTaskFailures = 1, #5 and #6 are the ONLY situations in which
> a job will be rescheduled.
>

True.


> 8. These rules are unlikely to change in the future ;)
>

True, though we could add more states, which would invalidate (4) and (6).
In practice, we have changed the states and their meanings very little in
~5 years.


> Finally, I noticed something odd: ASSIGNED -> LOST has followups [KILL,
> RESCHEDULE], but STARTING and RUNNING -> LOST only has [RESCHEDULE] as a
> followup. Why?
>

This is because ASSIGNED -> LOST may mean that there was a race between
creating the task and Aurora timing out the launch (it may not have heard
back from mesos).  To reduce the likelihood of a redundant instance, we try
to proactively kill the race.  The RUNNING state does not time out, so we
do not have the same concern there.


> Thanks,
> Hussein Elgridly
> Senior Software Engineer, DSDE
> The Broad Institute of MIT and Harvard
>

Re: Making sense of Aurora terminal states

Reply via email to