[jira] [Commented] (FLINK-25893) ResourceManagerServiceImpl's lifecycle can lead to exceptions

Xintong Song (Jira) Sun, 06 Feb 2022 19:33:07 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-25893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487853#comment-17487853
 ]


Xintong Song commented on FLINK-25893:
--------------------------------------

I also have the impression of a similar discussion, just couldn't remember whom 
I discussed with. :P

I'm leaning towards to option 2).

I'm afraid your option 1) would not work. 
- Any component that is responsible for the deregistering would need a leader 
election. Otherwise, we may accidentally deregister the application from a 
non-leading master process while there is another leading master process. Thus, 
we still face the same problem that deregistering is called while no leader is 
elected.
- It exposes the Kubernetes/Yarn client to either {{ClusterEntrypoint}} or 
{{Dispatcher}}, which complicates the system.

For option 2), changing the contract means the process will exit with code 0 
when there's no leading RM. That should not affect the native Kubernetes & Yarn 
deployment (neither of them relies on the exit code for process restarting), 
but will help the standalone Kubernetes deployment (which performs nothing in 
deregistering and relies on the exit code for restarting).

If we have consensus, I can work on this ticket and make the following changes:
- Makes {{ResourceManagerGateway#deregisterApplication}} wait for the leading 
RM being fully started.
- Change the contract to not deregister the application if there's no leading 
RM in the process.

> ResourceManagerServiceImpl's lifecycle can lead to exceptions
> -------------------------------------------------------------
>
>                 Key: FLINK-25893
>                 URL: https://issues.apache.org/jira/browse/FLINK-25893
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0, 1.14.3
>            Reporter: Till Rohrmann
>            Priority: Critical
>              Labels: pull-request-available
>
> The {{ResourceManagerServiceImpl}} lifecycle can lead to exceptions when 
> calling {{ResourceManagerServiceImpl.deregisterApplication}}. The problem 
> arises when the {{DispatcherResourceManagerComponent}} is shutdown before the 
> {{ResourceManagerServiceImpl}} gains leadership or while it is starting the 
> {{ResourceManager}}.
> One problem is that {{deregisterApplication}} returns an exceptionally 
> completed future if there is no leading {{ResourceManager}}.
> Another problem is that if there is a leading {{ResourceManager}}, then it 
> can still be the case that it has not been started yet. If this is the case, 
> then 
> [ResourceManagerGateway.deregisterApplication|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManagerServiceImpl.java#L143]
>  will be discarded. The reason for this behaviour is that we create a 
> {{ResourceManager}} in one {{Runnable}} and only start it in another. Due to 
> this there can be the {{deregisterApplication}} call that gets the {{lock}} 
> in between.
> I'd suggest to correct the lifecycle and contract of the 
> {{ResourceManagerServiceImpl.deregisterApplication}}.
> Please note that due to this problem, the error reporting of this method has 
> been suppressed. See FLINK-25885 for more details.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-25893) ResourceManagerServiceImpl's lifecycle can lead to exceptions

Reply via email to