XComp commented on PR #21137:
URL: https://github.com/apache/flink/pull/21137#issuecomment-1300806942

   > To me the issue stems more from both the runner and election service 
calling into each other under locks (== fundamental issue that should never 
happen), and locks maybe being way too broad.
   For example, why is Runner#closeAsync doing the entire shutdown under the 
lock? Modifying the state should suffice, because all other operations are 
checking that it's running.
   
   As far as we concluded, the problem appears if the `CompletableFuture` 
that's returned by `JobMasterServiceProcess#closeAsync` (see 
[JobMasterServiceLeadershipRunner.java:145](https://github.com/apache/flink/blob/113299701cc0c41bf7fc4bbe86cebd3beea8dbe3/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMasterServiceLeadershipRunner.java#L145))
 completes before triggering the callback that follows (that releases the 
`ClassLoaderLease` and stops the `DefaultLeaderElectionService`). In that case, 
the entire callback will be executed in the synchronized block rightaway, which 
is not what we want. We could work around that issue by calling 
`runAfterwardsAsync`, instead, which would make sure that the followup calls 
are not executed within the synchronized block. It feels like a similar pattern 
(i.e. introducing an async execution in a separate thread) to what we have in 
the PR right now (just in a different location). Does that sound more 
reasonable?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to