I've just spent the afternoon making a flowchart out of
TaskStateMachine.java in an attempt to figure out what Aurora states
actually mean. Given that all the jobs I submit have unique names and I
don't permit retries, I would like to put together a set of rules that
determine whether a job is _really_ terminal and definitely won't be
rescheduled.

Would one of the Aurora devs be willing to play a game of True or False
with the following statements?

1. If all my job names are unique and I do an aurora job status
--write-json, there will be at most one element in the "active" list.

2. Jobs in the "inactive" list are ordered by last update time, most recent
first.

3. A job's "status" will always equal the status of the last item in its
"taskEvents" list.

4. The full list of terminal states is [LOST, FINISHED, FAILED, KILLED]. A
job that is not in one of these states will undergo more transitions and
will remain in the "active" list until it gets to one of these states.
(Will I ever see DELETED, or do they not show up in aurora job status?)

5. A job in the LOST state will always be rescheduled unless it went
through KILLING first. (What does this represent - killed by user and then
lost connectivity to the slave?)

6. A job will be rescheduled if if it goes through one of [RESTARTING,
DRAINING, PREEMPTING].

7. Assuming maxTaskFailures = 1, #5 and #6 are the ONLY situations in which
a job will be rescheduled.

8. These rules are unlikely to change in the future ;)

Finally, I noticed something odd: ASSIGNED -> LOST has followups [KILL,
RESCHEDULE], but STARTING and RUNNING -> LOST only has [RESCHEDULE] as a
followup. Why?

Thanks,
Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard

Reply via email to