Flink job recovery after task manager failure

Afek, Ifat (Nokia - IL/Kfar Sava) Wed, 23 Feb 2022 08:25:01 -0800

Hi,

I am trying to use Flink checkpoints solution in order to support task manager 
recovery.
I’m running flink using beam with filesystem storage and the following 
parameters:
checkpointingInterval=30000
checkpointingMode=EXACTLY_ONCE.


What I see is that if I kill a task manager pod, it takes flink about 30 
seconds to identify the failure and another 5-6 minutes to restart the jobs.
Is there a way to shorten the downtime? What is an expected downtime in case 
the task manager is killed, until the jobs are recovered? Are there any best 
practices for handling it? (e.g. different configuration parameters)

Thanks,
Ifat

Flink job recovery after task manager failure

Reply via email to