Gyula Fora created FLINK-34318: ---------------------------------- Summary: AdaptiveScheduler resource stabilisation should happen before the job is cancelled Key: FLINK-34318 URL: https://issues.apache.org/jira/browse/FLINK-34318 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Reporter: Gyula Fora
When a new resource requirement is submitted to the AdaptiveScheduler which increases the resource upper bound (max taskmanagers), when the first TaskManager comes up the job is immediately cancelled. Once the job is cancelled the scheduler waits for the entire stabilisation period to pass if it cannot acquire all resources before starting with the lower-than-requested parallelism. The problem here is that waiting for resource stabilisation happens after the job is cancelled, introducing unnecessary downtime for the job if the stabilisation period is large. We should change logic here to wait for the stabilisation period first to acquire all possible resources before cancelling the job. -- This message was sent by Atlassian Jira (v8.20.10#820010)