[jira] [Comment Edited] (FLINK-18451) Flink HA on yarn may appear TaskManager double running when HA is restored

ming li (Jira) Fri, 31 Jul 2020 20:13:43 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-18451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169221#comment-17169221
 ]


ming li edited comment on FLINK-18451 at 8/1/20, 3:12 AM:
----------------------------------------------------------

In fact, this middleware is provided by another team. At the beginning of the 
design, the user is required to control the management of the consumer's life 
cycle. We originally thought that Flink would guarantee that the task was 
cancelled before restarting the task, until we discovered this problem when we 
tested Flink HA.

If we always wait for a heartbeat timeout before restarting the job to ensure 
that all tasks have ended, it will undoubtedly make the job recovery time 
longer (at least greater than the time of a heartbeat timeout).

Our current preliminary idea is to add a wait process before the JobManager 
restarts the Job (only in the case of JobManager failover). In this wait 
process, the JobManager will receive all task information reported by the 
re-registered TaskManager (just like reporting to Same as the initialSlotReport 
of ResourceManager). Wait only ends in the following two situations:
 1. All task information is re-reported.
 2. Wait for the heartbeat timeout period to be reached.
 Then we restart the Job.

Considering that this kind of rogue TaskExecutors is relatively rare, adding 
this process will not increase too much recovery time for most cases. In the 
case of rogue TaskExecutors, we can only wait for a maximum timeout to ensure 
that the old task is stopped.


was (Author: ming li):
In fact, this middleware is provided by another team. At the beginning of the 
design, the user is required to control the management of the consumer's life 
cycle. We originally thought that Flink would guarantee that the task was 
cancelled before restarting the task, until we discovered this problem when we 
tested Flink HA.

 

If we always wait for a heartbeat timeout before restarting the job to ensure 
that all tasks have ended, it will undoubtedly make the job recovery time 
longer (at least greater than the time of a heartbeat timeout).

 

Our current preliminary idea is to add a wait process before the JobManager 
restarts the Job (only in the case of JobManager failover). In this wait 
process, the JobManager will receive all task information reported by the 
re-registered TaskManager (just like reporting to Same as the initialSlotReport 
of ResourceManager). Wait only ends in the following two situations:
1. All task information is re-reported.
2. Wait for the heartbeat timeout period to be reached.
Then we restart the Job.

 

Considering that this kind of rogue TaskExecutors is relatively rare, adding 
this process will not increase too much recovery time for most cases. In the 
case of rogue TaskExecutors, we can only wait for a maximum timeout to ensure 
that the old task is stopped.

> Flink HA on yarn may appear TaskManager double running when HA is restored
> --------------------------------------------------------------------------
>
>                 Key: FLINK-18451
>                 URL: https://issues.apache.org/jira/browse/FLINK-18451
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.9.0
>            Reporter: ming li
>            Priority: Major
>              Labels: high-availability
>
> We found that when NodeManager is lost, the new JobManager will be restored 
> by Yarn's ResourceManager, and the Leader node will be registered on 
> Zookeeper. The original TaskManager will find the new JobManager through 
> Zookeeper and close the old JobManager connection. At this time, all tasks of 
> the TaskManager will fail. The new JobManager will directly perform job 
> recovery and recover from the latest checkpoint.
> However, during the recovery process, when a TaskManager is abnormally 
> connected to Zookeeper, it is not registered with the new JobManager in time. 
> Before the following timeout:
> 1. Connect with Zookeeper
> 2. Heartbeat with JobManager/ResourceManager
> Task will continue to run (assuming that Task can run independently in 
> TaskManager). Assuming that HA recovers fast enough, some Task double runs 
> will occur at this time.
> Do we need to make a persistent record of the cluster resources we allocated 
> during the runtime, and use it to judge all Task stops when HA is restored?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-18451) Flink HA on yarn may appear TaskManager double running when HA is restored

Reply via email to