Hi Jessica,

multiple factors affect the total recovery time. First of all, Flink needs
to detect that something went wrong. In the worst case this happens through
the missing heartbeat of a died machine. The default heartbeat value is
configured to 50s but one can tune it.

Next, Flink needs to cancel the running tasks. The time needed for this
operation is mainly influenced what the user code is doing. In the normal
case, this should be quite fast.

After all tasks have been cancelled, Flink will ask the configured restart
strategy how much time it should wait before restarting the job. Once this
happens, Flink will restart the job from the last valid checkpoint. If you
have activated local recovery and all previously used machines are still
available, then the recovery should be almost instantaneously. If this is
not the case, then Flink needs to download the checkpoint data from the
persistent storage. The time to do this mainly depends on the state size
and network/IO capacity.

The size of the checkpoint can depend on the type of checkpoint you are
choosing. Incremental checkpoints have the benefit that they are usually
faster to create but they can blow up the effective size of the checkpoint
a bit. This, however, strongly depends on the access pattern and how
RocksDB compacts the sst files.

Once this is done, then Flink will start executing the job from the
checkpointed position. Depending on your process rate, the checkpoint
position and the rate of incoming events, this is the last part which
decides how fast Flink will catch up. If p is the process rate, i the rate
of incoming events and diff the difference between the checkpoint position
and the head of the queue, then it takes diff / (p - i) seconds until Flink
has caught up with the head.

Cheers,
Till

On Mon, Feb 10, 2020 at 10:30 PM Woods, Jessica Hui <
jessica.wo...@campus.tu-berlin.de> wrote:

> ??Hi,
>
> I am working with Flink at the moment and am interested in knowing how one
> could estimate the Total Recovery Time for an application after checkpoint
> recovery. What I am specifically interested in is knowing the time needed
> for the recovery of the state + the catch-up phase (since the application's
> source tasks are reset to an earlier input position after recovery, this
> would be the data it processed before the failure and data that accumulated
> while the application was down).
>
> My questions are, What important considerations should I take into account
> to estimate this time and which parts of the codebase would this
> modification involve?
>
> Thanks,
> Jessica
>
>

Reply via email to