[ https://issues.apache.org/jira/browse/FLINK-25893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488063#comment-17488063 ]
Xintong Song commented on FLINK-25893: -------------------------------------- For option 1): Do we have the guarantee that the deregistration only happens when there's a leading \{{Dispatcher}}? I think it can also happen e.g., in the catch block of \{{ClusterEntrypoint#startCluster}}. This may not be a perfect example, as the process will exit with non-zero code anyway. My point is we may still need to face the problem when shutting down the cluster without a leading \{{Dispatcher}}. > ResourceManagerServiceImpl's lifecycle can lead to exceptions > ------------------------------------------------------------- > > Key: FLINK-25893 > URL: https://issues.apache.org/jira/browse/FLINK-25893 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.15.0, 1.14.3 > Reporter: Till Rohrmann > Assignee: Xintong Song > Priority: Critical > Labels: pull-request-available > > The {{ResourceManagerServiceImpl}} lifecycle can lead to exceptions when > calling {{ResourceManagerServiceImpl.deregisterApplication}}. The problem > arises when the {{DispatcherResourceManagerComponent}} is shutdown before the > {{ResourceManagerServiceImpl}} gains leadership or while it is starting the > {{ResourceManager}}. > One problem is that {{deregisterApplication}} returns an exceptionally > completed future if there is no leading {{ResourceManager}}. > Another problem is that if there is a leading {{ResourceManager}}, then it > can still be the case that it has not been started yet. If this is the case, > then > [ResourceManagerGateway.deregisterApplication|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManagerServiceImpl.java#L143] > will be discarded. The reason for this behaviour is that we create a > {{ResourceManager}} in one {{Runnable}} and only start it in another. Due to > this there can be the {{deregisterApplication}} call that gets the {{lock}} > in between. > I'd suggest to correct the lifecycle and contract of the > {{ResourceManagerServiceImpl.deregisterApplication}}. > Please note that due to this problem, the error reporting of this method has > been suppressed. See FLINK-25885 for more details. -- This message was sent by Atlassian Jira (v8.20.1#820001)