Bill Farner created AURORA-121:
----------------------------------

             Summary: Make the preemptor more efficient
                 Key: AURORA-121
                 URL: https://issues.apache.org/jira/browse/AURORA-121
             Project: Aurora
          Issue Type: Story
          Components: Scheduler
            Reporter: Bill Farner


When {{TaskSchedulerImpl}} fails to find an open slot for a task, it falls back 
to the preemptor:

{code}
if (!offerQueue.launchFirst(getAssignerFunction(taskId, task))) {
  // Task could not be scheduled.
  maybePreemptFor(taskId);
  return TaskSchedulerResult.TRY_AGAIN;
}
{code}

This can be problematic when the task store is large (O(10k tasks)) and there 
is a steady supply of PENDING tasks not satisfied by open slots.  This will 
manifest as an overall degraded/slow scheduler, and logs of slow queries used 
for preemption:
{noformat}
I0125 17:47:36.970 THREAD23 
org.apache.aurora.scheduler.storage.mem.MemTaskStore.fetchTasks: Query took 107 
ms: TaskQuery(owner:null, environment:null, jobName:null,
taskIds:null, statuses:[KILLING, ASSIGNED, STARTING, RUNNING, RESTARTING], 
slaveHost:null, instanceIds:null)
{noformat}

Several approaches come to mind to improve this situation:
- (easy) More aggressively back off on tasks that cannot be satisfied
- (easy) Fall back to preemption less frequently
- (harder) Scan for preemption candidates asynchronously, freeing up the 
TaskScheduler thread and the storage write lock.  Scans could be kicked off by 
the task scheduler, ideally in a way that doesn't dogpile.  This could also be 
done in a weakly-consistent way to minimally contribute to storage contention.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to