[ https://issues.apache.org/jira/browse/FLINK-21136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17295566#comment-17295566 ]
Robert Metzger commented on FLINK-21136: ---------------------------------------- It seems that getting the timeout behavior of adaptive scheduler for reactive mode right seems to be a delicate topic. Before commencing with the implementation, I therefore lay out my thinking. Requirements: - As a user of Reactive Mode, I usually first launch the JobManager, and then one or more TaskManagers. There might be a delay between the JM and the TM start, since I do this operation manually (up to a few minutes) - As a user of reactive mode, I want to configure the behavior of reactive mode wrt the timeout behavior: wait indefinitely for TaskManagers to connect vs wait for x seconds. The default should be "indefinitely" Changes: - {{WaitingForResources.notifyNewResourcesAvailable()}} currently calls "Context.hasEnoughResources()", which "checks whether we have enough resources to fulfill the desired resources.". In reactive mode, we usually never have enough resources to fulfill the desired resources. This means in Reactive Mode, we always have to wait for the hardcoded 10 seconds resource timeout before anything happens. Therefore, we rename the method and change its semantics to "can we execute with the given resources". Alternatively, we make the behavior pluggable through a controller ("MinimumInitialResourcesController") Further notes: - Executing state is immediately reacting to new resources (depending on the ScaleUpController). - Most likely, in production settings, users will want some configurable "cooldown phase". However, for the sake of keeping the first version simple, this should be sufficient. > Reactive Mode: Adjust timeout behavior in adaptive scheduler > ------------------------------------------------------------ > > Key: FLINK-21136 > URL: https://issues.apache.org/jira/browse/FLINK-21136 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination > Reporter: Robert Metzger > Assignee: Robert Metzger > Priority: Major > Fix For: 1.13.0 > > > The FLIP states the following timeout and resource registration behavior: > On initial startup, the declarative scheduler will wait indefinitely for > TaskManagers to show up. Once there are enough TaskManagers available to > start the job, and the set of resources is stable (see FLIP-160 for a > definition), the job will start running. > Once the job has started running, and a TaskManager is lost, it will wait for > 10 seconds for the TaskManager to re-appear. Otherwise, the job will be > scheduled again with the available resources. If no TaskManagers are > available anymore, the declarative scheduler will wait indefinitely again for > new resources. -- This message was sent by Atlassian Jira (v8.3.4#803005)