[ https://issues.apache.org/jira/browse/FLINK-25893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488033#comment-17488033 ]
Till Rohrmann commented on FLINK-25893: --------------------------------------- For option 1) I think a leading {{Dispatcher}} would decide that it is now time to shut down and to deregister the application. Then the {{ClusterEntrypoint}} would get the signal and initiate the deregistration. For option 2): Assuming that eventually a {{Dispatcher}} becomes leader that is running in the same process as the leading {{RM}} which then triggers the shut down, I think this can work. Moreover with FLINK-24038 the problem of a leading RM and {{Dispatcher}} running in different processes should no longer happen. > ResourceManagerServiceImpl's lifecycle can lead to exceptions > ------------------------------------------------------------- > > Key: FLINK-25893 > URL: https://issues.apache.org/jira/browse/FLINK-25893 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.15.0, 1.14.3 > Reporter: Till Rohrmann > Assignee: Xintong Song > Priority: Critical > Labels: pull-request-available > > The {{ResourceManagerServiceImpl}} lifecycle can lead to exceptions when > calling {{ResourceManagerServiceImpl.deregisterApplication}}. The problem > arises when the {{DispatcherResourceManagerComponent}} is shutdown before the > {{ResourceManagerServiceImpl}} gains leadership or while it is starting the > {{ResourceManager}}. > One problem is that {{deregisterApplication}} returns an exceptionally > completed future if there is no leading {{ResourceManager}}. > Another problem is that if there is a leading {{ResourceManager}}, then it > can still be the case that it has not been started yet. If this is the case, > then > [ResourceManagerGateway.deregisterApplication|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManagerServiceImpl.java#L143] > will be discarded. The reason for this behaviour is that we create a > {{ResourceManager}} in one {{Runnable}} and only start it in another. Due to > this there can be the {{deregisterApplication}} call that gets the {{lock}} > in between. > I'd suggest to correct the lifecycle and contract of the > {{ResourceManagerServiceImpl.deregisterApplication}}. > Please note that due to this problem, the error reporting of this method has > been suppressed. See FLINK-25885 for more details. -- This message was sent by Atlassian Jira (v8.20.1#820001)