[ 
https://issues.apache.org/jira/browse/FLINK-20332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291411#comment-17291411
 ] 

Xintong Song commented on FLINK-20332:
--------------------------------------

I find a problem when trying to manually verify the PR for this feature.

For pending resources, an important assumption is that each pending slot / 
worker corresponds to an upcoming UNOCCUPIED slot / worker. Without this 
assumption, it would be complicated to handle the cases that a pending resource 
is assigned to a pending request while the registered resource turns out to be 
occupied. Based on this assumption, currently in all slot manager 
implementations, only unoccupied registered resources will be matched with 
pending resources.

The assumption is guaranteed true for pending resources newly requested by RM. 
It is not always true for the resources recovered from previous attempt, 
especially when JM/RM failover is fast enough.
# When JobMaster & TM are disconnected, all tasks of the disconnected job are 
failed. However, the slots are not freed immediately. Instead, they go through 
an inactive timeout (default 10s), during which if the JobMaster is 
re-connected (could be another instance as long as it has the same job id) they 
will be offered again. That means TM may offer the slots allocated in previous 
attempt directly to the JM, even without being connected to the new RM.
# If the new leader of RM is elected earlier than JM, it could happen that a TM 
registers to the new RM before being aware of the failure of the old JM. 

In both cases, TM may contain occupied slots when register to the new RM, thus 
will not be matched to the pending resource that represents the recovered 
worker. This is fatal, because the pending resource can be used for fulfilling 
future requests, while no actual resource is expected to register.

For 1), a possible workaround it to make TM release slots immediately on JM 
disconnected. However, this forces JM to always go through RM for requesting 
new slots, which will be a regression for cases where JM can be reconnected in 
10s. For 2), I don't see any good way to make sure the recovered workers are 
unoccupied.

To sum up, I'm leaning towards giving up on this improvement and relying on 
FLINK-18229 for canceling additional resource requests.

[~trohrmann], WDYT?

> Add workers recovered from previous attempt to pending resources
> ----------------------------------------------------------------
>
>                 Key: FLINK-20332
>                 URL: https://issues.apache.org/jira/browse/FLINK-20332
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>            Reporter: Xintong Song
>            Assignee: Xintong Song
>            Priority: Major
>              Labels: pull-request-available
>
> For active deployments (Native K8s/Yarn/Mesos), after a JM failover, workers 
> from previous attempt should register to the new JM. Depending on the order 
> that slot requests and TM registrations arrive at the RM, it could happen 
> that RM allocates unnecessary new resources while there are recovered 
> resources that can be reused.
> A potential improvement is to add recovered workers to pending resources, so 
> that RM knows what resources are expected to be available soon and decide 
> whether to allocate new resources accordingly.
> See also the discussion in FLINK-20249.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to