[ https://issues.apache.org/jira/browse/FLINK-34318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17812600#comment-17812600 ]
Gyula Fora commented on FLINK-34318: ------------------------------------ cc [~dmvk] [~chesnay] [~mxm] what do you guys think? > AdaptiveScheduler resource stabilisation should happen before the job is > cancelled > ---------------------------------------------------------------------------------- > > Key: FLINK-34318 > URL: https://issues.apache.org/jira/browse/FLINK-34318 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Reporter: Gyula Fora > Priority: Major > > When a new resource requirement is submitted to the AdaptiveScheduler which > increases the resource upper bound (max taskmanagers), when the first > TaskManager comes up the job is immediately cancelled. > Once the job is cancelled the scheduler waits for the entire stabilisation > period to pass if it cannot acquire all resources before starting with the > lower-than-requested parallelism. > The problem here is that waiting for resource stabilisation happens after the > job is cancelled, introducing unnecessary downtime for the job if the > stabilisation period is large. > We should change logic here to wait for the stabilisation period first to > acquire all possible resources before cancelling the job. -- This message was sent by Atlassian Jira (v8.20.10#820010)