[jira] [Commented] (FLINK-34318) AdaptiveScheduler resource stabilisation should happen before the job is cancelled

Gyula Fora (Jira) Wed, 31 Jan 2024 01:50:19 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-34318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17812600#comment-17812600
 ]


Gyula Fora commented on FLINK-34318:
------------------------------------

cc [~dmvk] [~chesnay] [~mxm] 

what do you guys think?

> AdaptiveScheduler resource stabilisation should happen before the job is 
> cancelled
> ----------------------------------------------------------------------------------
>
>                 Key: FLINK-34318
>                 URL: https://issues.apache.org/jira/browse/FLINK-34318
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: Gyula Fora
>            Priority: Major
>
> When a new resource requirement is submitted to the AdaptiveScheduler which 
> increases the resource upper bound (max taskmanagers), when the first 
> TaskManager comes up the job is immediately cancelled. 
> Once the job is cancelled the scheduler waits for the entire stabilisation 
> period to pass if it cannot acquire all resources before starting with the 
> lower-than-requested parallelism.
> The problem here is that waiting for resource stabilisation happens after the 
> job is cancelled, introducing unnecessary downtime for the job if the 
> stabilisation period is large.
> We should change logic here to wait for the stabilisation period first to 
> acquire all possible resources before cancelling the job.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34318) AdaptiveScheduler resource stabilisation should happen before the job is cancelled

Reply via email to