Re: Flink checkpoint recovery time

Till Rohrmann Wed, 19 Aug 2020 00:40:00 -0700

Hi Zhinan,

for i) the cancellation depends on the user code. If the user code does a
blocking operation, Flink needs to wait until it returns from there before
it can move the Task's state to CANCELED.


for ii) I think your observation is correct. Could you please open a JIRA
issue for this problem so that it can be fixed in Flink? Thanks a lot!

For the time to restore the checkpoints it could also be interesting to add
a proper metric to Flink. Hence, you could also create a JIRA issue for it.

Cheers,
Till

On Wed, Aug 19, 2020 at 8:43 AM Zhinan Cheng <znch...@cse.cuhk.edu.hk>
wrote:

> Hi Yun,
>
> Thanks a lot for your help. Seems hard to measure the checkpointing
> restore time currently.
> I do monitor the "fullRestarts" metric and others like "uptime" and
> "downtime" to observe some information about failure recovery.
>
> Still some confusions:
> i） I found the time for the jobmanager to make the job from status
> CANCELING to status CANCELED up to 30s?
>      Is there any reason why it takes so long? Can I reduce this time?
> ii) Currently the way to calculate the "downtime"  is not consistent with
> the description in the doc, now the downtime is actually the current
> timestamp minus the time timestamp when the job started.
>     But I think the doc obviously only want to measure the current
> timestamp minus the timestamp when the job failed.
>
> I think I need to measure these times by adding specified metrics myself.
>
> Regards,
> Zhinan
>
>
>
>
> On Wed, 19 Aug 2020 at 01:45, Yun Tang <myas...@live.com> wrote:
>
>> Hi Zhinan,
>>
>> For the time to detect the failure, you could refer to the time when
>> 'fullRestarts' increase. That could give you information about the time of
>> job failure.
>>
>> For the checkpoint recovery time, there actually exist two parts:
>>
>>    1. The time to read checkpoint meta in JM. However, this duration of
>>    time has no explicit metrics currently as that part of duration would be
>>    nearly just reading 1 MB file remotely from DFS.
>>    2. The time for tasks to restore state. This should be treated as the
>>    real time for checkpoint recovery and could even be 10 minutes+ when
>>    restoring savepoint. Unfortunately, this part of time is also not recorded
>>    in metrics now.
>>    If you find the task is in RUNNING state but not consume any record,
>>    that might be stuck in restoring checkpoint/savepoint.
>>
>>
>> Best
>> Yun Tang
>> ------------------------------
>> *From:* Zhinan Cheng <chingchi...@gmail.com>
>> *Sent:* Tuesday, August 18, 2020 11:50
>> *To:* user@flink.apache.org <user@flink.apache.org>
>> *Subject:* Flink checkpoint recovery time
>>
>> Hi all,
>>
>> I am working on measuring the failure recovery time of Flink and I want
>> to decompose the recovery time into different parts, say the time to detect
>> the failure, the time to restart the job, and the time to
>> restore the checkpointing.
>>
>> I found that I can measure the down time during failure and the time to
>> restart the job and some metric for the checkpointing as below.
>>
>> [image: measure.png]
>> Unfortunately, I cannot find any information about the failure detect
>> time and checkpoint recovery time, Is there any way that Flink has provided
>> for this, otherwise, how can I solve this?
>>
>> Thanks a lot for your help.
>>
>> Regards,
>>
>

Re: Flink checkpoint recovery time

Reply via email to