[ https://issues.apache.org/jira/browse/FLINK-18625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167202#comment-17167202 ]
Yangze Guo commented on FLINK-18625: ------------------------------------ Liu's PR also looks good to me. > Maintain redundant taskmanagers to speed up failover > ---------------------------------------------------- > > Key: FLINK-18625 > URL: https://issues.apache.org/jira/browse/FLINK-18625 > Project: Flink > Issue Type: New Feature > Components: Runtime / Coordination > Reporter: Liu > Assignee: Liu > Priority: Major > Labels: pull-request-available > > When flink job fails because of killed taskmanagers, it will request new > containers when restarting. Requesting new containers can be very slow, > sometimes it takes dozens of seconds even more. The reasons can be different, > for example, yarn and hdfs are slow, machine performance is poor. In some > product scenario, SLA is high and failover should be in seconds. > > To speed up the recovery process, we can maintain redundant slots in advance. > When job restarts, it can use the redundant slots at once instead of > requesting new taskmanagers. > > The implemention can be done in SlotManagerImpl. Below is a brief description: > # In construct method, init redundantTaskmanagerNum from config. > # In method start(), allocate redundant taskmanagers. > # In method start(), Change taskManagerTimeoutCheck() to > checkValidTaskManagers(). > # In method checkValidTaskManagers(), manage redundant taskmanagers and > timeout taskmanagers. The idle taskmanager number must be not less than > redundantTaskmanagerNum. > * If less, allocate from resourceManager until equal. > * If more, release timeout taskmanagers but keep at least > redundantTaskmanagerNum idle taskmanagers. -- This message was sent by Atlassian Jira (v8.3.4#803005)