Hi Yun,

Thanks a lot for your help. Seems hard to measure the checkpointing restore
time currently.
I do monitor the "fullRestarts" metric and others like "uptime" and
"downtime" to observe some information about failure recovery.

Still some confusions:
i) I found the time for the jobmanager to make the job from status
CANCELING to status CANCELED up to 30s?
     Is there any reason why it takes so long? Can I reduce this time?
ii) Currently the way to calculate the "downtime"  is not consistent with
the description in the doc, now the downtime is actually the current
timestamp minus the time timestamp when the job started.
    But I think the doc obviously only want to measure the current
timestamp minus the timestamp when the job failed.

I think I need to measure these times by adding specified metrics myself.

Regards,
Zhinan




On Wed, 19 Aug 2020 at 01:45, Yun Tang <myas...@live.com> wrote:

> Hi Zhinan,
>
> For the time to detect the failure, you could refer to the time when
> 'fullRestarts' increase. That could give you information about the time of
> job failure.
>
> For the checkpoint recovery time, there actually exist two parts:
>
>    1. The time to read checkpoint meta in JM. However, this duration of
>    time has no explicit metrics currently as that part of duration would be
>    nearly just reading 1 MB file remotely from DFS.
>    2. The time for tasks to restore state. This should be treated as the
>    real time for checkpoint recovery and could even be 10 minutes+ when
>    restoring savepoint. Unfortunately, this part of time is also not recorded
>    in metrics now.
>    If you find the task is in RUNNING state but not consume any record,
>    that might be stuck in restoring checkpoint/savepoint.
>
>
> Best
> Yun Tang
> ------------------------------
> *From:* Zhinan Cheng <chingchi...@gmail.com>
> *Sent:* Tuesday, August 18, 2020 11:50
> *To:* user@flink.apache.org <user@flink.apache.org>
> *Subject:* Flink checkpoint recovery time
>
> Hi all,
>
> I am working on measuring the failure recovery time of Flink and I want to
> decompose the recovery time into different parts, say the time to detect
> the failure, the time to restart the job, and the time to
> restore the checkpointing.
>
> I found that I can measure the down time during failure and the time to
> restart the job and some metric for the checkpointing as below.
>
> [image: measure.png]
> Unfortunately, I cannot find any information about the failure detect time
> and checkpoint recovery time, Is there any way that Flink has provided for
> this, otherwise, how can I solve this?
>
> Thanks a lot for your help.
>
> Regards,
>

Reply via email to