[jira] [Commented] (FLINK-20332) Add workers recovered from previous attempt to pending resources

Till Rohrmann (Jira) Fri, 26 Feb 2021 01:50:04 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-20332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291521#comment-17291521
 ]


Till Rohrmann commented on FLINK-20332:
---------------------------------------

I think you are right [~xintongsong]. We cannot assume that the recovered 
workers are empty. If we want to support graceful JM failover in the future, 
the TMs must keep their tasks running for a little bit. Hence, stopping them 
directly on JM disconnection might not be possible.

The only way I could see it work atm is to have a second pool of "pending" 
slots which might or might not be free. If they are not free, then the system 
would require to request more resources. But I am not sure whether the benefits 
are worth the extra efforts. Hence, +1 for not pursuing ticket this for the 
time being (and closing it as won't do).

> Add workers recovered from previous attempt to pending resources
> ----------------------------------------------------------------
>
>                 Key: FLINK-20332
>                 URL: https://issues.apache.org/jira/browse/FLINK-20332
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>            Reporter: Xintong Song
>            Assignee: Xintong Song
>            Priority: Major
>              Labels: pull-request-available
>
> For active deployments (Native K8s/Yarn/Mesos), after a JM failover, workers 
> from previous attempt should register to the new JM. Depending on the order 
> that slot requests and TM registrations arrive at the RM, it could happen 
> that RM allocates unnecessary new resources while there are recovered 
> resources that can be reused.
> A potential improvement is to add recovered workers to pending resources, so 
> that RM knows what resources are expected to be available soon and decide 
> whether to allocate new resources accordingly.
> See also the discussion in FLINK-20249.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-20332) Add workers recovered from previous attempt to pending resources

Reply via email to