edu05 commented on pull request #16487: URL: https://github.com/apache/flink/pull/16487#issuecomment-885344162
Hi @dmvk while writing the acceptance test I found a couple of things that don't quite make sense to me at the moment. Could you help please? 1. I have found the new call to the recover method to still be in the JobMaster's main thread, not outside of it as desired. You can see this by debugging the new IT I added to the PR with a breakpoint inside recover. I'm attaching a sample image, notice how the call to recover is made from SchedulerUtils (as intended) but that call is in turn made from inside JobMaster's main thread, not outside. ![debug](https://user-images.githubusercontent.com/1392421/126728056-e14f36b4-bd74-4f9e-a4d6-807c98bf6b51.png) 2. Even if the call was made from a separate thread, the first call to recover would only "warm up" for the period of time before the second call to recover via CheckpointCoordinator. If the delay between both calls is shorter than the time it takes for the first recover to execute, the JobMaster will become stalled at that point and unable to take RPC calls. Does that make sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org