Re: Flink checkpoint recovery time

2020-08-21 Thread Zhinan Cheng
/master/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L1390 > > Cheers, > Till > > On Thu, Aug 20, 2020 at 6:17 PM Zhinan Cheng > wrote: > >> Hi Till, >> >> Thanks for the quick reply. >> >> Yes, the job actu

Re: Decompose failure recovery time

2020-08-20 Thread Zhinan Cheng
some extent > (in an exchange for dirty shutdown, without cleaning up the resources). > > Piotrek > > czw., 20 sie 2020 o 18:00 Zhinan Cheng napisał(a): >> >> Hi Piotr, >> >> Thanks a lot for your help. >> Yes, I finally realize that I can only ap

Re: Flink checkpoint recovery time

2020-08-20 Thread Zhinan Cheng
n. > > Cheers, > Till > > On Thu, Aug 20, 2020 at 4:41 PM Zhinan Cheng > wrote: > >> Hi Till, >> >> Sorry for the late reply. >> Attached is the log of jobmanager. >> I notice that during canceling the job, the jobmanager also warns that >> the

Re: Decompose failure recovery time

2020-08-20 Thread Zhinan Cheng
rojects/flink/flink-docs-stable/monitoring/rest_api.html#jobs > [2] https://issues.apache.org/jira/browse/FLINK-17012 > wt., 18 sie 2020 o 04:07 Zhinan Cheng napisał(a): > > > Hi all, > > > > I am working on measuring the failure recovery time of Flink and I > > wan

Re: Flink checkpoint recovery time

2020-08-19 Thread Zhinan Cheng
issue for > it. > > Cheers, > Till > > On Wed, Aug 19, 2020 at 8:43 AM Zhinan Cheng > wrote: > >> Hi Yun, >> >> Thanks a lot for your help. Seems hard to measure the checkpointing >> restore time currently. >> I do monitor the "fullResta

Re: Flink checkpoint recovery time

2020-08-18 Thread Zhinan Cheng
uck in restoring checkpoint/savepoint. > > > Best > Yun Tang > -- > *From:* Zhinan Cheng > *Sent:* Tuesday, August 18, 2020 11:50 > *To:* user@flink.apache.org > *Subject:* Flink checkpoint recovery time > > Hi all, > > I am work

Flink checkpoint recovery time

2020-08-17 Thread Zhinan Cheng
Hi all, I am working on measuring the failure recovery time of Flink and I want to decompose the recovery time into different parts, say the time to detect the failure, the time to restart the job, and the time to restore the checkpointing. I found that I can measure the down time during failure