[ https://issues.apache.org/jira/browse/FLINK-18625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166829#comment-17166829 ]
Xintong Song commented on FLINK-18625: -------------------------------------- [~trohrmann], regarding your questions. bq. How would this feature work if the job requests heterogeneous slots which might result into differently sized TMs? I guess we will allocate default sized TMs. But what if this will prevent us from allocating fewer larger sized TMs which are required for fulfilling the heterogeneous slot requests? I see your point. One optimization could be to release the redundant task managers if there are heterogeneous pending worker requests. The problem is that the redundant task manager may not be releasable if any of the slots are allocated (e.g., slots are evenly spread out), and even releasable it would cost more time to obtain the new task manager. I guess that's the price we need to pay if this feature is enabled. WDYT? bq. How does this feature relate to FLINK-16605 and FLINK-15959? I believe that the lower and upper bounds should also limit the number of redundant slots, right? According to [~Jiangang]'s PR, the upper bound also limits the number of redundant slots. I believe it should be the same for the lower bound. We should make sure of that when working on FLINK-15959. cc [~karmagyz] > Maintain redundant taskmanagers to speed up failover > ---------------------------------------------------- > > Key: FLINK-18625 > URL: https://issues.apache.org/jira/browse/FLINK-18625 > Project: Flink > Issue Type: New Feature > Components: Runtime / Coordination > Reporter: Liu > Assignee: Liu > Priority: Major > Labels: pull-request-available > > When flink job fails because of killed taskmanagers, it will request new > containers when restarting. Requesting new containers can be very slow, > sometimes it takes dozens of seconds even more. The reasons can be different, > for example, yarn and hdfs are slow, machine performance is poor. In some > product scenario, SLA is high and failover should be in seconds. > > To speed up the recovery process, we can maintain redundant slots in advance. > When job restarts, it can use the redundant slots at once instead of > requesting new taskmanagers. > > The implemention can be done in SlotManagerImpl. Below is a brief description: > # In construct method, init redundantTaskmanagerNum from config. > # In method start(), allocate redundant taskmanagers. > # In method start(), Change taskManagerTimeoutCheck() to > checkValidTaskManagers(). > # In method checkValidTaskManagers(), manage redundant taskmanagers and > timeout taskmanagers. The idle taskmanager number must be not less than > redundantTaskmanagerNum. > * If less, allocate from resourceManager until equal. > * If more, release timeout taskmanagers but keep at least > redundantTaskmanagerNum idle taskmanagers. -- This message was sent by Atlassian Jira (v8.3.4#803005)