[ https://issues.apache.org/jira/browse/FLINK-20332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291521#comment-17291521 ]
Till Rohrmann commented on FLINK-20332: --------------------------------------- I think you are right [~xintongsong]. We cannot assume that the recovered workers are empty. If we want to support graceful JM failover in the future, the TMs must keep their tasks running for a little bit. Hence, stopping them directly on JM disconnection might not be possible. The only way I could see it work atm is to have a second pool of "pending" slots which might or might not be free. If they are not free, then the system would require to request more resources. But I am not sure whether the benefits are worth the extra efforts. Hence, +1 for not pursuing ticket this for the time being (and closing it as won't do). > Add workers recovered from previous attempt to pending resources > ---------------------------------------------------------------- > > Key: FLINK-20332 > URL: https://issues.apache.org/jira/browse/FLINK-20332 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination > Reporter: Xintong Song > Assignee: Xintong Song > Priority: Major > Labels: pull-request-available > > For active deployments (Native K8s/Yarn/Mesos), after a JM failover, workers > from previous attempt should register to the new JM. Depending on the order > that slot requests and TM registrations arrive at the RM, it could happen > that RM allocates unnecessary new resources while there are recovered > resources that can be reused. > A potential improvement is to add recovered workers to pending resources, so > that RM knows what resources are expected to be available soon and decide > whether to allocate new resources accordingly. > See also the discussion in FLINK-20249. -- This message was sent by Atlassian Jira (v8.3.4#803005)