Gyula Fora created FLINK-34318:
----------------------------------

             Summary: AdaptiveScheduler resource stabilisation should happen 
before the job is cancelled
                 Key: FLINK-34318
                 URL: https://issues.apache.org/jira/browse/FLINK-34318
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Coordination
            Reporter: Gyula Fora


When a new resource requirement is submitted to the AdaptiveScheduler which 
increases the resource upper bound (max taskmanagers), when the first 
TaskManager comes up the job is immediately cancelled. 

Once the job is cancelled the scheduler waits for the entire stabilisation 
period to pass if it cannot acquire all resources before starting with the 
lower-than-requested parallelism.

The problem here is that waiting for resource stabilisation happens after the 
job is cancelled, introducing unnecessary downtime for the job if the 
stabilisation period is large.

We should change logic here to wait for the stabilisation period first to 
acquire all possible resources before cancelling the job.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to