[jira] [Commented] (FLINK-18451) Flink HA on yarn may appear TaskManager double running when HA is restored

ming li (Jira) Fri, 03 Jul 2020 13:17:23 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-18451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150766#comment-17150766
 ]


ming li commented on FLINK-18451:
---------------------------------

Hi,[~trohrmann].

Thanks for your reply. We are currently doing tests before Flink HA goes 
online, so we try to simulate various abnormal conditions to verify that HA is 
in line with expectations. During this period, this problem occurred. In a 
production environment, I'm not sure whether this problem has occurred or just 
been ignored, because this problem is not easy to reproduce, or even if it 
does, the impact is not clear.

You are advised to adjust the heartbeat timeout time to be less than the 
recovery time of JobManager. But the shorter the JobManager recovery time, the 
shorter the heartbeat timeout time, then our fault tolerance for the network 
will be lower.

Whether it is at-least-once or exactly-once, I worry that if the source does 
not support simultaneous consumption by multiple consumers, it may cause data 
loss. On the other hand, we should try our best to reduce this kind of data 
duplication problem, so I think it is necessary for us to make sure that all 
the previous tasks have ended when the JobManager restores the tasks.

> Flink HA on yarn may appear TaskManager double running when HA is restored
> --------------------------------------------------------------------------
>
>                 Key: FLINK-18451
>                 URL: https://issues.apache.org/jira/browse/FLINK-18451
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.9.0
>            Reporter: ming li
>            Priority: Major
>              Labels: high-availability
>
> We found that when NodeManager is lost, the new JobManager will be restored 
> by Yarn's ResourceManager, and the Leader node will be registered on 
> Zookeeper. The original TaskManager will find the new JobManager through 
> Zookeeper and close the old JobManager connection. At this time, all tasks of 
> the TaskManager will fail. The new JobManager will directly perform job 
> recovery and recover from the latest checkpoint.
> However, during the recovery process, when a TaskManager is abnormally 
> connected to Zookeeper, it is not registered with the new JobManager in time. 
> Before the following timeout:
> 1. Connect with Zookeeper
> 2. Heartbeat with JobManager/ResourceManager
> Task will continue to run (assuming that Task can run independently in 
> TaskManager). Assuming that HA recovers fast enough, some Task double runs 
> will occur at this time.
> Do we need to make a persistent record of the cluster resources we allocated 
> during the runtime, and use it to judge all Task stops when HA is restored?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18451) Flink HA on yarn may appear TaskManager double running when HA is restored

Reply via email to