Hi, I am trying to use Flink checkpoints solution in order to support task manager recovery. I’m running flink using beam with filesystem storage and the following parameters: checkpointingInterval=30000 checkpointingMode=EXACTLY_ONCE.
What I see is that if I kill a task manager pod, it takes flink about 30 seconds to identify the failure and another 5-6 minutes to restart the jobs. Is there a way to shorten the downtime? What is an expected downtime in case the task manager is killed, until the jobs are recovered? Are there any best practices for handling it? (e.g. different configuration parameters) Thanks, Ifat