[ https://issues.apache.org/jira/browse/FLINK-20332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291411#comment-17291411 ]
Xintong Song commented on FLINK-20332: -------------------------------------- I find a problem when trying to manually verify the PR for this feature. For pending resources, an important assumption is that each pending slot / worker corresponds to an upcoming UNOCCUPIED slot / worker. Without this assumption, it would be complicated to handle the cases that a pending resource is assigned to a pending request while the registered resource turns out to be occupied. Based on this assumption, currently in all slot manager implementations, only unoccupied registered resources will be matched with pending resources. The assumption is guaranteed true for pending resources newly requested by RM. It is not always true for the resources recovered from previous attempt, especially when JM/RM failover is fast enough. # When JobMaster & TM are disconnected, all tasks of the disconnected job are failed. However, the slots are not freed immediately. Instead, they go through an inactive timeout (default 10s), during which if the JobMaster is re-connected (could be another instance as long as it has the same job id) they will be offered again. That means TM may offer the slots allocated in previous attempt directly to the JM, even without being connected to the new RM. # If the new leader of RM is elected earlier than JM, it could happen that a TM registers to the new RM before being aware of the failure of the old JM. In both cases, TM may contain occupied slots when register to the new RM, thus will not be matched to the pending resource that represents the recovered worker. This is fatal, because the pending resource can be used for fulfilling future requests, while no actual resource is expected to register. For 1), a possible workaround it to make TM release slots immediately on JM disconnected. However, this forces JM to always go through RM for requesting new slots, which will be a regression for cases where JM can be reconnected in 10s. For 2), I don't see any good way to make sure the recovered workers are unoccupied. To sum up, I'm leaning towards giving up on this improvement and relying on FLINK-18229 for canceling additional resource requests. [~trohrmann], WDYT? > Add workers recovered from previous attempt to pending resources > ---------------------------------------------------------------- > > Key: FLINK-20332 > URL: https://issues.apache.org/jira/browse/FLINK-20332 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination > Reporter: Xintong Song > Assignee: Xintong Song > Priority: Major > Labels: pull-request-available > > For active deployments (Native K8s/Yarn/Mesos), after a JM failover, workers > from previous attempt should register to the new JM. Depending on the order > that slot requests and TM registrations arrive at the RM, it could happen > that RM allocates unnecessary new resources while there are recovered > resources that can be reused. > A potential improvement is to add recovered workers to pending resources, so > that RM knows what resources are expected to be available soon and decide > whether to allocate new resources accordingly. > See also the discussion in FLINK-20249. -- This message was sent by Atlassian Jira (v8.3.4#803005)