Liu created FLINK-18625:
---------------------------

             Summary: Maintain redundant taskmanagers to speed up failover
                 Key: FLINK-18625
                 URL: https://issues.apache.org/jira/browse/FLINK-18625
             Project: Flink
          Issue Type: New Feature
            Reporter: Liu


When flink job fails because of killed taskmanagers, it will request new 
containers when restarting. Requesting new containers can be very slow, 
sometimes it takes dozens of seconds even more. The reasons can be different, 
for example, yarn and hdfs are slow, machine performance is poor. In some 
product scenario, SLA is high and failover should keep be in seconds.

 

To speed up the recovery process, we can maintain redundant taskmanagers in 
advance. When job restarts, it can use the redundant taskmanagers at once 
instead of requesting new taskmanagers.

 

The implemention can be done in SlotManagerImpl. Below is a brief description:
 # In construct method, init redundantTaskmanagerNum from config.
 # In method start(), allocate redundant taskmanagers.
 # In method start(), Change taskManagerTimeoutCheck() to 
redundantTaskmanagerCheck().
 # In method redundantTaskmanagerCheck(), manage redundant taskmanagers and 
timeout taskmanagers. The idle taskmanager number must be not less than 
redundantTaskmanagerNum.

 * If less, allocate from resourceManager until equal.
 * If more, release timeout taskmanagers but keep at least 
redundantTaskmanagerNum idle taskmanagers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to