Hi Jessica, multiple factors affect the total recovery time. First of all, Flink needs to detect that something went wrong. In the worst case this happens through the missing heartbeat of a died machine. The default heartbeat value is configured to 50s but one can tune it.
Next, Flink needs to cancel the running tasks. The time needed for this operation is mainly influenced what the user code is doing. In the normal case, this should be quite fast. After all tasks have been cancelled, Flink will ask the configured restart strategy how much time it should wait before restarting the job. Once this happens, Flink will restart the job from the last valid checkpoint. If you have activated local recovery and all previously used machines are still available, then the recovery should be almost instantaneously. If this is not the case, then Flink needs to download the checkpoint data from the persistent storage. The time to do this mainly depends on the state size and network/IO capacity. The size of the checkpoint can depend on the type of checkpoint you are choosing. Incremental checkpoints have the benefit that they are usually faster to create but they can blow up the effective size of the checkpoint a bit. This, however, strongly depends on the access pattern and how RocksDB compacts the sst files. Once this is done, then Flink will start executing the job from the checkpointed position. Depending on your process rate, the checkpoint position and the rate of incoming events, this is the last part which decides how fast Flink will catch up. If p is the process rate, i the rate of incoming events and diff the difference between the checkpoint position and the head of the queue, then it takes diff / (p - i) seconds until Flink has caught up with the head. Cheers, Till On Mon, Feb 10, 2020 at 10:30 PM Woods, Jessica Hui < jessica.wo...@campus.tu-berlin.de> wrote: > ??Hi, > > I am working with Flink at the moment and am interested in knowing how one > could estimate the Total Recovery Time for an application after checkpoint > recovery. What I am specifically interested in is knowing the time needed > for the recovery of the state + the catch-up phase (since the application's > source tasks are reset to an earlier input position after recovery, this > would be the data it processed before the failure and data that accumulated > while the application was down). > > My questions are, What important considerations should I take into account > to estimate this time and which parts of the codebase would this > modification involve? > > Thanks, > Jessica > >