XComp commented on PR #21137: URL: https://github.com/apache/flink/pull/21137#issuecomment-1300806942
> To me the issue stems more from both the runner and election service calling into each other under locks (== fundamental issue that should never happen), and locks maybe being way too broad. For example, why is Runner#closeAsync doing the entire shutdown under the lock? Modifying the state should suffice, because all other operations are checking that it's running. As far as we concluded, the problem appears if the `CompletableFuture` that's returned by `JobMasterServiceProcess#closeAsync` (see [JobMasterServiceLeadershipRunner.java:145](https://github.com/apache/flink/blob/113299701cc0c41bf7fc4bbe86cebd3beea8dbe3/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMasterServiceLeadershipRunner.java#L145)) completes before triggering the callback that follows (that releases the `ClassLoaderLease` and stops the `DefaultLeaderElectionService`). In that case, the entire callback will be executed in the synchronized block rightaway, which is not what we want. We could work around that issue by calling `runAfterwardsAsync`, instead, which would make sure that the followup calls are not executed within the synchronized block. It feels like a similar pattern (i.e. introducing an async execution in a separate thread) to what we have in the PR right now (just in a different location). Does that sound more reasonable? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org