Re: Making sense of Aurora terminal states

Hussein Elgridly Fri, 20 Feb 2015 11:19:42 -0800

>> 5. A job in the LOST state will always be rescheduled unless it went
>> through KILLING first. (What does this represent - killed by user and
then
>> lost connectivity to the slave?)
>>


> True.  That is one way it could happen, it could also happen if the
> scheduler times the task out while waiting to hear back from mesos after
> attempting to kill the task.

Just to be absolutely clear on this: KILLING -> LOST will _never_ result in
a reschedule? What happens if Mesos fails to kill the task and finishes
running it - will it pass a success message back to Aurora that then gets
thrown away?

Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard


On 20 February 2015 at 11:08, Hussein Elgridly <huss...@broadinstitute.org>
wrote:

> This is fantastic (and I'm glad that my understanding was mostly correct)
> - thanks a lot.
>
> Might I suggest folding this information into the user guide? Maybe it's
> only relevant for my use case, but I feel like "tasks in terminal states
> might be cloned and rescheduled; here's when that might happened" isn't
> made as explicit as it could be. I know I'd have had an easier time if
> there had been an explanation of "here's what each state means and what
> might happen next", and I can imagine [weasel words; citation needed] that
> other users might also find this useful.
>
> Hussein Elgridly
> Senior Software Engineer, DSDE
> The Broad Institute of MIT and Harvard
>
>
> On 19 February 2015 at 17:35, Bill Farner <wfar...@apache.org> wrote:
>
>> On Thu, Feb 19, 2015 at 1:27 PM, Hussein Elgridly <
>> huss...@broadinstitute.org> wrote:
>>
>> > I've just spent the afternoon making a flowchart out of
>> > TaskStateMachine.java in an attempt to figure out what Aurora states
>> > actually mean. Given that all the jobs I submit have unique names and I
>> > don't permit retries, I would like to put together a set of rules that
>> > determine whether a job is _really_ terminal and definitely won't be
>> > rescheduled.
>> >
>> > Would one of the Aurora devs be willing to play a game of True or False
>> > with the following statements?
>> >
>> > 1. If all my job names are unique and I do an aurora job status
>> > --write-json, there will be at most one element in the "active" list.
>> >
>>
>> True iff the job has only one instance.
>>
>>
>> > 2. Jobs in the "inactive" list are ordered by last update time, most
>> recent
>> > first.
>> >
>>
>> False.  They are sorted by instance ID [1], which doesn't make much sense.
>>
>> [1]
>>
>> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/src/main/python/apache/aurora/client/cli/jobs.py#L635-L636
>>
>>
>> > 3. A job's "status" will always equal the status of the last item in its
>> > "taskEvents" list.
>> >
>>
>> True.
>>
>>
>> > 4. The full list of terminal states is [LOST, FINISHED, FAILED,
>> KILLED]. A
>> > job that is not in one of these states will undergo more transitions and
>> > will remain in the "active" list until it gets to one of these states.
>> > (Will I ever see DELETED, or do they not show up in aurora job status?)
>> >
>>
>> True.  Source of truth is [1].  We actually don't have a state [2] for
>> DELETED.
>>
>> [1]
>>
>> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L410-L413
>> [2]
>>
>> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L348-L380
>>
>>
>> > 5. A job in the LOST state will always be rescheduled unless it went
>> > through KILLING first. (What does this represent - killed by user and
>> then
>> > lost connectivity to the slave?)
>> >
>>
>> True.  That is one way it could happen, it could also happen if the
>> scheduler times the task out while waiting to hear back from mesos after
>> attempting to kill the task.
>>
>>
>> > 6. A job will be rescheduled if if it goes through one of [RESTARTING,
>> > DRAINING, PREEMPTING].
>> >
>>
>> True.
>>
>>
>> > 7. Assuming maxTaskFailures = 1, #5 and #6 are the ONLY situations in
>> which
>> > a job will be rescheduled.
>> >
>>
>> True.
>>
>>
>> > 8. These rules are unlikely to change in the future ;)
>> >
>>
>> True, though we could add more states, which would invalidate (4) and (6).
>> In practice, we have changed the states and their meanings very little in
>> ~5 years.
>>
>>
>> > Finally, I noticed something odd: ASSIGNED -> LOST has followups [KILL,
>> > RESCHEDULE], but STARTING and RUNNING -> LOST only has [RESCHEDULE] as a
>> > followup. Why?
>> >
>>
>> This is because ASSIGNED -> LOST may mean that there was a race between
>> creating the task and Aurora timing out the launch (it may not have heard
>> back from mesos).  To reduce the likelihood of a redundant instance, we
>> try
>> to proactively kill the race.  The RUNNING state does not time out, so we
>> do not have the same concern there.
>>
>>
>> > Thanks,
>> > Hussein Elgridly
>> > Senior Software Engineer, DSDE
>> > The Broad Institute of MIT and Harvard
>> >
>>
>
>

Re: Making sense of Aurora terminal states

Reply via email to