How to know if task-local recovery kicked in for some nodes?

Sonam Mandal Thu, 01 Apr 2021 17:18:20 -0700

Hello,

We are experimenting with task local recovery and I wanted to know whether 
there is a way to validate that some tasks of the job recovered from the local 
state rather than the remote state.


We've currently set this up to have 2 Task Managers with 2 slots each, and we 
run a job with parallelism 4. To simulate failure, we kill one of the Task 
Manager pods (we run on Kubernetes). I want to see if the local state of the 
other Task Manager was used or not. I do understand that the state for the 
killed Task Manager will need to be fetched from the checkpoint.

Also, do you have any suggestions on how to test such failure scenarios in a 
better way?

Thanks,
Sonam

How to know if task-local recovery kicked in for some nodes?

Reply via email to