[jira] [Commented] (FLINK-21136) Reactive Mode: Adjust timeout behavior in adaptive scheduler

Robert Metzger (Jira) Thu, 04 Mar 2021 12:07:17 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-21136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17295566#comment-17295566
 ]


Robert Metzger commented on FLINK-21136:
----------------------------------------

It seems that getting the timeout behavior of adaptive scheduler for reactive 
mode right seems to be a delicate topic. Before commencing with the 
implementation, I therefore lay out my thinking.

Requirements:
- As a user of Reactive Mode, I usually first launch the JobManager, and then 
one or more TaskManagers. There might be a delay between the JM and the TM 
start, since I do this operation manually (up to a few minutes)
- As a user of reactive mode, I want to configure the behavior of reactive mode 
wrt the timeout behavior: wait indefinitely for TaskManagers to connect vs wait 
for x seconds. The default should be "indefinitely"

Changes:
- {{WaitingForResources.notifyNewResourcesAvailable()}} currently calls 
"Context.hasEnoughResources()", which "checks whether we have enough resources 
to fulfill the desired resources.". In reactive mode, we usually never have 
enough resources to fulfill the desired resources. This means in Reactive Mode, 
we always have to wait for the hardcoded 10 seconds resource timeout before 
anything happens. Therefore, we rename the method and change its semantics to 
"can we execute with the given resources". Alternatively, we make the behavior 
pluggable through a controller ("MinimumInitialResourcesController")


Further notes:
- Executing state is immediately reacting to new resources (depending on the 
ScaleUpController).
- Most likely, in production settings, users will want some configurable 
"cooldown phase". However, for the sake of keeping the first version simple, 
this should be sufficient.

> Reactive Mode: Adjust timeout behavior in adaptive scheduler
> ------------------------------------------------------------
>
>                 Key: FLINK-21136
>                 URL: https://issues.apache.org/jira/browse/FLINK-21136
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>            Reporter: Robert Metzger
>            Assignee: Robert Metzger
>            Priority: Major
>             Fix For: 1.13.0
>
>
> The FLIP states the following timeout and resource registration behavior: 
> On initial startup, the declarative scheduler will wait indefinitely for 
> TaskManagers to show up. Once there are enough TaskManagers available to 
> start the job, and the set of resources is stable (see FLIP-160 for a 
> definition), the job will start running.
> Once the job has started running, and a TaskManager is lost, it will wait for 
> 10 seconds for the TaskManager to re-appear. Otherwise, the job will be 
> scheduled again with the available resources. If no TaskManagers are 
> available anymore, the declarative scheduler will wait indefinitely again for 
> new resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-21136) Reactive Mode: Adjust timeout behavior in adaptive scheduler

Reply via email to