This is fantastic (and I'm glad that my understanding was mostly correct) - thanks a lot.
Might I suggest folding this information into the user guide? Maybe it's only relevant for my use case, but I feel like "tasks in terminal states might be cloned and rescheduled; here's when that might happened" isn't made as explicit as it could be. I know I'd have had an easier time if there had been an explanation of "here's what each state means and what might happen next", and I can imagine [weasel words; citation needed] that other users might also find this useful. Hussein Elgridly Senior Software Engineer, DSDE The Broad Institute of MIT and Harvard On 19 February 2015 at 17:35, Bill Farner <wfar...@apache.org> wrote: > On Thu, Feb 19, 2015 at 1:27 PM, Hussein Elgridly < > huss...@broadinstitute.org> wrote: > > > I've just spent the afternoon making a flowchart out of > > TaskStateMachine.java in an attempt to figure out what Aurora states > > actually mean. Given that all the jobs I submit have unique names and I > > don't permit retries, I would like to put together a set of rules that > > determine whether a job is _really_ terminal and definitely won't be > > rescheduled. > > > > Would one of the Aurora devs be willing to play a game of True or False > > with the following statements? > > > > 1. If all my job names are unique and I do an aurora job status > > --write-json, there will be at most one element in the "active" list. > > > > True iff the job has only one instance. > > > > 2. Jobs in the "inactive" list are ordered by last update time, most > recent > > first. > > > > False. They are sorted by instance ID [1], which doesn't make much sense. > > [1] > > https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/src/main/python/apache/aurora/client/cli/jobs.py#L635-L636 > > > > 3. A job's "status" will always equal the status of the last item in its > > "taskEvents" list. > > > > True. > > > > 4. The full list of terminal states is [LOST, FINISHED, FAILED, KILLED]. > A > > job that is not in one of these states will undergo more transitions and > > will remain in the "active" list until it gets to one of these states. > > (Will I ever see DELETED, or do they not show up in aurora job status?) > > > > True. Source of truth is [1]. We actually don't have a state [2] for > DELETED. > > [1] > > https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L410-L413 > [2] > > https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L348-L380 > > > > 5. A job in the LOST state will always be rescheduled unless it went > > through KILLING first. (What does this represent - killed by user and > then > > lost connectivity to the slave?) > > > > True. That is one way it could happen, it could also happen if the > scheduler times the task out while waiting to hear back from mesos after > attempting to kill the task. > > > > 6. A job will be rescheduled if if it goes through one of [RESTARTING, > > DRAINING, PREEMPTING]. > > > > True. > > > > 7. Assuming maxTaskFailures = 1, #5 and #6 are the ONLY situations in > which > > a job will be rescheduled. > > > > True. > > > > 8. These rules are unlikely to change in the future ;) > > > > True, though we could add more states, which would invalidate (4) and (6). > In practice, we have changed the states and their meanings very little in > ~5 years. > > > > Finally, I noticed something odd: ASSIGNED -> LOST has followups [KILL, > > RESCHEDULE], but STARTING and RUNNING -> LOST only has [RESCHEDULE] as a > > followup. Why? > > > > This is because ASSIGNED -> LOST may mean that there was a race between > creating the task and Aurora timing out the launch (it may not have heard > back from mesos). To reduce the likelihood of a redundant instance, we try > to proactively kill the race. The RUNNING state does not time out, so we > do not have the same concern there. > > > > Thanks, > > Hussein Elgridly > > Senior Software Engineer, DSDE > > The Broad Institute of MIT and Harvard > > >