>> 5. A job in the LOST state will always be rescheduled unless it went >> through KILLING first. (What does this represent - killed by user and then >> lost connectivity to the slave?) >>
> True. That is one way it could happen, it could also happen if the > scheduler times the task out while waiting to hear back from mesos after > attempting to kill the task. Just to be absolutely clear on this: KILLING -> LOST will _never_ result in a reschedule? What happens if Mesos fails to kill the task and finishes running it - will it pass a success message back to Aurora that then gets thrown away? Hussein Elgridly Senior Software Engineer, DSDE The Broad Institute of MIT and Harvard On 20 February 2015 at 11:08, Hussein Elgridly <huss...@broadinstitute.org> wrote: > This is fantastic (and I'm glad that my understanding was mostly correct) > - thanks a lot. > > Might I suggest folding this information into the user guide? Maybe it's > only relevant for my use case, but I feel like "tasks in terminal states > might be cloned and rescheduled; here's when that might happened" isn't > made as explicit as it could be. I know I'd have had an easier time if > there had been an explanation of "here's what each state means and what > might happen next", and I can imagine [weasel words; citation needed] that > other users might also find this useful. > > Hussein Elgridly > Senior Software Engineer, DSDE > The Broad Institute of MIT and Harvard > > > On 19 February 2015 at 17:35, Bill Farner <wfar...@apache.org> wrote: > >> On Thu, Feb 19, 2015 at 1:27 PM, Hussein Elgridly < >> huss...@broadinstitute.org> wrote: >> >> > I've just spent the afternoon making a flowchart out of >> > TaskStateMachine.java in an attempt to figure out what Aurora states >> > actually mean. Given that all the jobs I submit have unique names and I >> > don't permit retries, I would like to put together a set of rules that >> > determine whether a job is _really_ terminal and definitely won't be >> > rescheduled. >> > >> > Would one of the Aurora devs be willing to play a game of True or False >> > with the following statements? >> > >> > 1. If all my job names are unique and I do an aurora job status >> > --write-json, there will be at most one element in the "active" list. >> > >> >> True iff the job has only one instance. >> >> >> > 2. Jobs in the "inactive" list are ordered by last update time, most >> recent >> > first. >> > >> >> False. They are sorted by instance ID [1], which doesn't make much sense. >> >> [1] >> >> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/src/main/python/apache/aurora/client/cli/jobs.py#L635-L636 >> >> >> > 3. A job's "status" will always equal the status of the last item in its >> > "taskEvents" list. >> > >> >> True. >> >> >> > 4. The full list of terminal states is [LOST, FINISHED, FAILED, >> KILLED]. A >> > job that is not in one of these states will undergo more transitions and >> > will remain in the "active" list until it gets to one of these states. >> > (Will I ever see DELETED, or do they not show up in aurora job status?) >> > >> >> True. Source of truth is [1]. We actually don't have a state [2] for >> DELETED. >> >> [1] >> >> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L410-L413 >> [2] >> >> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L348-L380 >> >> >> > 5. A job in the LOST state will always be rescheduled unless it went >> > through KILLING first. (What does this represent - killed by user and >> then >> > lost connectivity to the slave?) >> > >> >> True. That is one way it could happen, it could also happen if the >> scheduler times the task out while waiting to hear back from mesos after >> attempting to kill the task. >> >> >> > 6. A job will be rescheduled if if it goes through one of [RESTARTING, >> > DRAINING, PREEMPTING]. >> > >> >> True. >> >> >> > 7. Assuming maxTaskFailures = 1, #5 and #6 are the ONLY situations in >> which >> > a job will be rescheduled. >> > >> >> True. >> >> >> > 8. These rules are unlikely to change in the future ;) >> > >> >> True, though we could add more states, which would invalidate (4) and (6). >> In practice, we have changed the states and their meanings very little in >> ~5 years. >> >> >> > Finally, I noticed something odd: ASSIGNED -> LOST has followups [KILL, >> > RESCHEDULE], but STARTING and RUNNING -> LOST only has [RESCHEDULE] as a >> > followup. Why? >> > >> >> This is because ASSIGNED -> LOST may mean that there was a race between >> creating the task and Aurora timing out the launch (it may not have heard >> back from mesos). To reduce the likelihood of a redundant instance, we >> try >> to proactively kill the race. The RUNNING state does not time out, so we >> do not have the same concern there. >> >> >> > Thanks, >> > Hussein Elgridly >> > Senior Software Engineer, DSDE >> > The Broad Institute of MIT and Harvard >> > >> > >