Handling of aurora update when job/task cannot be scheduled

Anindya Sinha Thu, 15 May 2014 04:51:03 -0700

Hi

Wanted to propose a modification in handling of aurora update when the job
or a task cannot be scheduled immediately based on my understanding of job
scheduling within aurora.
Please feel free to share your comments and/or concerns.


Thanks
Anindya

*Scenario*
Assume we have a job with 2 RUNNING instances (say instance 0 and 1) in the
cluster, and then "aurora update" is issued on the same job key which bumps
up the instance count to say 5. By default, it keeps instance 0 and 1
intact, and attempts to launch 3 additional instances and waits for
UpdateConfig.watch_secs for it to be in RUNNING state before moving on for
each instance.

Assume the cluster is in a state where only 1 additional instance can be
launched due to resource unavailability. Hence, instance 2 is executed (is
in RUNNING state) and instance 3 moves to PENDING state and when
UpdateConfig.restart_threshold expires, it deems this instance to be a
failed instance.
If UpdateConfig.rollback_on_failure is True(default), it rolls back the
changes done in the update and terminates instances 2 and 3.
If UpdateConfig.rollback_on_failure is False, it does NOP keeping instances
0 through 2 in RUNNING, and instance 3 in PENDING. Instance 4 is never
attempted in either of the scenarios.

*Proposal*
I propose that in aurora update, we should consider an instance in PENDING
state after UpdateConfig.restart_threshold timeout NOT to be failed case
 (and keep them in PENDING state). The reason behind this is that these
instances which could not be scheduled to execute at the time of aurora
update can be scheduled eventually in the future once there is a host in
the cluster that becomes available to run these instances (based on
resource availability in the future).

In the current approach, instance 4 is not even attempted to be scheduled
since instance 3 is considered to be a failure. Further, the scheduling of
jobs within aurora update should ideally be treated similar to aurora
create (since in case of a aurora create with instance count=5, we would
have 3 RUNNING instances and 2 instances in PENDING state assuming the
cluster is in a similar state).

UpdateConfig.rollback_on_failure=False does not address the above use case
for all scenarios since:
a) It works if the PENDING instance is the last instance to be launched,
but fails if there are additional instances to be launched (as in the
example above).
b) It disables rollback which may not be desirable for "real" failures to
launch tasks in the cluster.

Here is a JIRA that references this issue (which contains the same details
as in this email though):
Reference: https://issues.apache.org/jira/browse/AURORA-413

Handling of aurora update when job/task cannot be scheduled

Reply via email to