[GitHub] [flink] edu05 commented on pull request #16487: [FLINK-22483][runtime][coordination] Recover checkpoints when JobMaster gains leadership

GitBox Thu, 22 Jul 2021 18:42:12 -0700


edu05 commented on pull request #16487:
URL: https://github.com/apache/flink/pull/16487#issuecomment-885344162



   Hi @dmvk while writing the acceptance test I found a couple of things that 
don't quite make sense to me at the moment. Could you help please?
   
   1. I have found the new call to the recover method to still be in the 
JobMaster's main thread, not outside of it as desired. You can see this by 
debugging the new IT I added to the PR with a breakpoint inside recover. I'm 
attaching a sample image, notice how the call to recover is made from 
SchedulerUtils (as intended) but that call is in turn made from inside 
JobMaster's main thread, not outside.
   
![debug](https://user-images.githubusercontent.com/1392421/126728056-e14f36b4-bd74-4f9e-a4d6-807c98bf6b51.png)
   
   2. Even if the call was made from a separate thread, the first call to 
recover would only "warm up" for the period of time before the second call to 
recover via CheckpointCoordinator. If the delay between both calls is shorter 
than the time it takes for the first recover to execute, the JobMaster will 
become stalled at that point and unable to take RPC calls.
   
   Does that make sense?
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] edu05 commented on pull request #16487: [FLINK-22483][runtime][coordination] Recover checkpoints when JobMaster gains leadership

Reply via email to