Re: Making sense of Aurora terminal states

Bill Farner Sat, 21 Feb 2015 11:19:30 -0800

>
> Might I suggest folding this information into the user guide?


You seem like you are now sufficiently-equipped to add this doc.  Any
chance you're game to write the doc you wish you had read? :-)

Just to be absolutely clear on this: KILLING -> LOST will _never_ result in
> a reschedule? What happens if Mesos fails to kill the task and finishes
> running it - will it pass a success message back to Aurora that then gets
> thrown away?


Correct, it will not be rescheduled.  We count on reconciliation to take
care of this.  Currently that's the GC executor, and soon it will be direct
reconciliation with the master.


> Also (sorry for repeated messages), what's the deal with KILLING ->
> [FINISHED, FAILED]? User sends kill request but Mesos reports it's done
> before it gets through so congratulations, you get to keep it?


Correct, this would usually indicate a race between kill and task exit.




-=Bill

On Fri, Feb 20, 2015 at 1:11 PM, Hussein Elgridly <
huss...@broadinstitute.org> wrote:

> Also (sorry for repeated messages), what's the deal with KILLING ->
> [FINISHED, FAILED]? User sends kill request but Mesos reports it's done
> before it gets through so congratulations, you get to keep it?
>
> Hussein Elgridly
> Senior Software Engineer, DSDE
> The Broad Institute of MIT and Harvard
>
>
> On 20 February 2015 at 14:18, Hussein Elgridly <huss...@broadinstitute.org
> >
> wrote:
>
> > >> 5. A job in the LOST state will always be rescheduled unless it went
> > >> through KILLING first. (What does this represent - killed by user and
> > then
> > >> lost connectivity to the slave?)
> > >>
> >
> > > True.  That is one way it could happen, it could also happen if the
> > > scheduler times the task out while waiting to hear back from mesos
> after
> > > attempting to kill the task.
> >
> > Just to be absolutely clear on this: KILLING -> LOST will _never_ result
> > in a reschedule? What happens if Mesos fails to kill the task and
> finishes
> > running it - will it pass a success message back to Aurora that then gets
> > thrown away?
> >
> > Hussein Elgridly
> > Senior Software Engineer, DSDE
> > The Broad Institute of MIT and Harvard
> >
> >
> > On 20 February 2015 at 11:08, Hussein Elgridly <
> huss...@broadinstitute.org
> > > wrote:
> >
> >> This is fantastic (and I'm glad that my understanding was mostly
> correct)
> >> - thanks a lot.
> >>
> >> Might I suggest folding this information into the user guide? Maybe it's
> >> only relevant for my use case, but I feel like "tasks in terminal states
> >> might be cloned and rescheduled; here's when that might happened" isn't
> >> made as explicit as it could be. I know I'd have had an easier time if
> >> there had been an explanation of "here's what each state means and what
> >> might happen next", and I can imagine [weasel words; citation needed]
> that
> >> other users might also find this useful.
> >>
> >> Hussein Elgridly
> >> Senior Software Engineer, DSDE
> >> The Broad Institute of MIT and Harvard
> >>
> >>
> >> On 19 February 2015 at 17:35, Bill Farner <wfar...@apache.org> wrote:
> >>
> >>> On Thu, Feb 19, 2015 at 1:27 PM, Hussein Elgridly <
> >>> huss...@broadinstitute.org> wrote:
> >>>
> >>> > I've just spent the afternoon making a flowchart out of
> >>> > TaskStateMachine.java in an attempt to figure out what Aurora states
> >>> > actually mean. Given that all the jobs I submit have unique names
> and I
> >>> > don't permit retries, I would like to put together a set of rules
> that
> >>> > determine whether a job is _really_ terminal and definitely won't be
> >>> > rescheduled.
> >>> >
> >>> > Would one of the Aurora devs be willing to play a game of True or
> False
> >>> > with the following statements?
> >>> >
> >>> > 1. If all my job names are unique and I do an aurora job status
> >>> > --write-json, there will be at most one element in the "active" list.
> >>> >
> >>>
> >>> True iff the job has only one instance.
> >>>
> >>>
> >>> > 2. Jobs in the "inactive" list are ordered by last update time, most
> >>> recent
> >>> > first.
> >>> >
> >>>
> >>> False.  They are sorted by instance ID [1], which doesn't make much
> >>> sense.
> >>>
> >>> [1]
> >>>
> >>>
> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/src/main/python/apache/aurora/client/cli/jobs.py#L635-L636
> >>>
> >>>
> >>> > 3. A job's "status" will always equal the status of the last item in
> >>> its
> >>> > "taskEvents" list.
> >>> >
> >>>
> >>> True.
> >>>
> >>>
> >>> > 4. The full list of terminal states is [LOST, FINISHED, FAILED,
> >>> KILLED]. A
> >>> > job that is not in one of these states will undergo more transitions
> >>> and
> >>> > will remain in the "active" list until it gets to one of these
> states.
> >>> > (Will I ever see DELETED, or do they not show up in aurora job
> status?)
> >>> >
> >>>
> >>> True.  Source of truth is [1].  We actually don't have a state [2] for
> >>> DELETED.
> >>>
> >>> [1]
> >>>
> >>>
> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L410-L413
> >>> [2]
> >>>
> >>>
> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L348-L380
> >>>
> >>>
> >>> > 5. A job in the LOST state will always be rescheduled unless it went
> >>> > through KILLING first. (What does this represent - killed by user and
> >>> then
> >>> > lost connectivity to the slave?)
> >>> >
> >>>
> >>> True.  That is one way it could happen, it could also happen if the
> >>> scheduler times the task out while waiting to hear back from mesos
> after
> >>> attempting to kill the task.
> >>>
> >>>
> >>> > 6. A job will be rescheduled if if it goes through one of
> [RESTARTING,
> >>> > DRAINING, PREEMPTING].
> >>> >
> >>>
> >>> True.
> >>>
> >>>
> >>> > 7. Assuming maxTaskFailures = 1, #5 and #6 are the ONLY situations in
> >>> which
> >>> > a job will be rescheduled.
> >>> >
> >>>
> >>> True.
> >>>
> >>>
> >>> > 8. These rules are unlikely to change in the future ;)
> >>> >
> >>>
> >>> True, though we could add more states, which would invalidate (4) and
> >>> (6).
> >>> In practice, we have changed the states and their meanings very little
> in
> >>> ~5 years.
> >>>
> >>>
> >>> > Finally, I noticed something odd: ASSIGNED -> LOST has followups
> [KILL,
> >>> > RESCHEDULE], but STARTING and RUNNING -> LOST only has [RESCHEDULE]
> as
> >>> a
> >>> > followup. Why?
> >>> >
> >>>
> >>> This is because ASSIGNED -> LOST may mean that there was a race between
> >>> creating the task and Aurora timing out the launch (it may not have
> heard
> >>> back from mesos).  To reduce the likelihood of a redundant instance, we
> >>> try
> >>> to proactively kill the race.  The RUNNING state does not time out, so
> we
> >>> do not have the same concern there.
> >>>
> >>>
> >>> > Thanks,
> >>> > Hussein Elgridly
> >>> > Senior Software Engineer, DSDE
> >>> > The Broad Institute of MIT and Harvard
> >>> >
> >>>
> >>
> >>
> >
>

Re: Making sense of Aurora terminal states

Reply via email to