[ 
https://issues.apache.org/jira/browse/FLINK-18451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168898#comment-17168898
 ] 

Till Rohrmann edited comment on FLINK-18451 at 7/31/20, 2:30 PM:
-----------------------------------------------------------------

I understand the problem now [~Ming Li]. I think Flink's Kafka connector solves 
this problem by manually assigning the Kafka partitions. How does other systems 
which interact with your middleware solve this problem? Are you planning to 
extend your middleware to solve this problem from the middleware side?

>From Flink's perspective the best solution I can think of is to wait to 
>restart a job until one is sure that all rogue {{TaskExecutors}} have realized 
>that they are no longer connected to a cluster and, hence, stop the execution 
>of {{Tasks}}.


was (Author: till.rohrmann):
I understand the problem now [~Ming Li]. I think Flink's Kafka connector solves 
this problem by manually assigning the Kafka partitions. How does other systems 
which interact with your middleware solve this problem? Are you planning to 
extend your middleware to solve this problem?

> Flink HA on yarn may appear TaskManager double running when HA is restored
> --------------------------------------------------------------------------
>
>                 Key: FLINK-18451
>                 URL: https://issues.apache.org/jira/browse/FLINK-18451
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.9.0
>            Reporter: ming li
>            Priority: Major
>              Labels: high-availability
>
> We found that when NodeManager is lost, the new JobManager will be restored 
> by Yarn's ResourceManager, and the Leader node will be registered on 
> Zookeeper. The original TaskManager will find the new JobManager through 
> Zookeeper and close the old JobManager connection. At this time, all tasks of 
> the TaskManager will fail. The new JobManager will directly perform job 
> recovery and recover from the latest checkpoint.
> However, during the recovery process, when a TaskManager is abnormally 
> connected to Zookeeper, it is not registered with the new JobManager in time. 
> Before the following timeout:
> 1. Connect with Zookeeper
> 2. Heartbeat with JobManager/ResourceManager
> Task will continue to run (assuming that Task can run independently in 
> TaskManager). Assuming that HA recovers fast enough, some Task double runs 
> will occur at this time.
> Do we need to make a persistent record of the cluster resources we allocated 
> during the runtime, and use it to judge all Task stops when HA is restored?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to