I've just spent the afternoon making a flowchart out of TaskStateMachine.java in an attempt to figure out what Aurora states actually mean. Given that all the jobs I submit have unique names and I don't permit retries, I would like to put together a set of rules that determine whether a job is _really_ terminal and definitely won't be rescheduled.
Would one of the Aurora devs be willing to play a game of True or False with the following statements? 1. If all my job names are unique and I do an aurora job status --write-json, there will be at most one element in the "active" list. 2. Jobs in the "inactive" list are ordered by last update time, most recent first. 3. A job's "status" will always equal the status of the last item in its "taskEvents" list. 4. The full list of terminal states is [LOST, FINISHED, FAILED, KILLED]. A job that is not in one of these states will undergo more transitions and will remain in the "active" list until it gets to one of these states. (Will I ever see DELETED, or do they not show up in aurora job status?) 5. A job in the LOST state will always be rescheduled unless it went through KILLING first. (What does this represent - killed by user and then lost connectivity to the slave?) 6. A job will be rescheduled if if it goes through one of [RESTARTING, DRAINING, PREEMPTING]. 7. Assuming maxTaskFailures = 1, #5 and #6 are the ONLY situations in which a job will be rescheduled. 8. These rules are unlikely to change in the future ;) Finally, I noticed something odd: ASSIGNED -> LOST has followups [KILL, RESCHEDULE], but STARTING and RUNNING -> LOST only has [RESCHEDULE] as a followup. Why? Thanks, Hussein Elgridly Senior Software Engineer, DSDE The Broad Institute of MIT and Harvard