Re: How to know if task-local recovery kicked in for some nodes?

dhanesh arole Tue, 06 Apr 2021 08:35:03 -0700

Hi Sonam,

We have a similar setup. What I have observed is, when the task manager pod
gets killed and restarts again ( i.e. the entire task manager process
restarts ) then local recovery doesn't happen. Task manager restore process
actually downloads the latest completed checkpoint from the remote state
handle even when the older localState data is available. This happens
because with every run allocation-ids for tasks running on task manager
change as task manager restart causes global job failure and restart.


Local recovery - i.e task restore process using locally stored checkpoint
data kicks in when the task manager process is alive but due to some other
reason ( like timeout from sink or external dependency ) one of the tasks
fails and the flink job gets restarted by the job manager.

Please CMIIW


-
Dhanesh Arole

On Tue, Apr 6, 2021 at 11:35 AM Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Sonam,
>
> The easiest way to see whether local state has been used for recovery is
> the recovery time. Apart from that you can also look for "Found registered
> local state for checkpoint {} in subtask ({} - {} - {}" in the logs which
> is logged on debug. This indicates that the local state is available.
> However, it does not say whether it is actually used. E.g. when doing a
> rescaling operation we change the assignment of key group ranges which
> prevents local state from being used. However in case of a recovery the
> above-mentioned log message should indicate that we use local state
> recovery.
>
> Cheers,
> Till
>
> On Tue, Apr 6, 2021 at 11:31 AM Tzu-Li (Gordon) Tai <tzuli...@apache.org>
> wrote:
>
>> Hi Sonam,
>>
>> Pulling in Till (cc'ed), I believe he would likely be able to help you
>> here.
>>
>> Cheers,
>> Gordon
>>
>> On Fri, Apr 2, 2021 at 8:18 AM Sonam Mandal <soman...@linkedin.com>
>> wrote:
>>
>>> Hello,
>>>
>>> We are experimenting with task local recovery and I wanted to know
>>> whether there is a way to validate that some tasks of the job recovered
>>> from the local state rather than the remote state.
>>>
>>> We've currently set this up to have 2 Task Managers with 2 slots each,
>>> and we run a job with parallelism 4. To simulate failure, we kill one of
>>> the Task Manager pods (we run on Kubernetes). I want to see if the local
>>> state of the other Task Manager was used or not. I do understand that the
>>> state for the killed Task Manager will need to be fetched from the
>>> checkpoint.
>>>
>>> Also, do you have any suggestions on how to test such failure scenarios
>>> in a better way?
>>>
>>> Thanks,
>>> Sonam
>>>
>>

Re: How to know if task-local recovery kicked in for some nodes?

Reply via email to