Hi Piotr, Thanks a lot. I will try your suggestion to see what happen.
Regards, Zhinan On Fri, 21 Aug 2020 at 00:40, Piotr Nowojski <pnowoj...@apache.org> wrote: > > Hi Zhinan, > > It's hard to say, but my guess it takes that long for the tasks to respond to > cancellation which consists of a couple of steps. If a task is currently busy > processing something, it has to respond to interruption > (`java.lang.Thread#interrupt`). If it takes 30 seconds for a task to react to > the interruption and clean up it's resources, that can cause problems and > there is very little that Flink can do. > > If you want to debug it further, I would suggest collecting stack traces > during cancellation (or even better: profile the code during cancellation). > This would help you answer the question, what are the task threads busy with. > > Probably not a solution, but I'm mentioning it just in case, you can shorten > the `task.cancellation.timeout` period. By default it's 180s. After that, > whole TaskManager will be killed. If you have spare TaskManagers or you can > restart them very quickly, lowering this timeout might help to some extent > (in an exchange for dirty shutdown, without cleaning up the resources). > > Piotrek > > czw., 20 sie 2020 o 18:00 Zhinan Cheng <znch...@cse.cuhk.edu.hk> napisał(a): >> >> Hi Piotr, >> >> Thanks a lot for your help. >> Yes, I finally realize that I can only approximate the time for [1] >> and [3] and measure [2] by monitoring the uptime and downtime metric >> provided by Flink. >> >> And now my problem is that I found the time in [2] can be up to 40s, I >> wonder why it takes so long to restart the job. >> The log actually shows that the time to switch all operator instances >> from CANCELING to CANCELED is around 30s, do you have any ideas about >> this? >> >> Many thanks. >> >> Regards, >> Zhinan >> >> On Thu, 20 Aug 2020 at 21:26, Piotr Nowojski <pnowoj...@apache.org> wrote: >> > >> > Hi, >> > >> > > I want to decompose the recovery time into different parts, say >> > > (1) the time to detect the failure, >> > > (2) the time to restart the job, >> > > (3) and the time to restore the checkpointing. >> > >> > 1. Maybe I'm missing something, but as far as I can tell, Flink can not >> > help you with that. Time to detect the failure, would be a time between the >> > failure occurred, and the time when JobManager realises about this failure. >> > If we could reliably measure/check when the first one happened, then we >> > could immediately trigger failover. You are interested in this exactly >> > because there is no reliable way to detect the failure immediately. You >> > could approximate this via analysing the logs. >> > >> > 2. Maybe there are some metrics that you could use, if not you check use >> > the REST API [1] to monitor for the job status. Again you could also do it >> > via analysing the logs. >> > >> > 3. In the future this might be measurable using the REST API (similar as >> > the point 2.), but currently there is no way to do it that way. There is a >> > ticket for that [2]. I think currently the only way is to do it is via >> > analysing the logs. >> > >> > If you just need to do this once, I would analyse the logs manually. If you >> > want to do it many times or monitor this continuously, I would write some >> > simple script (python?) to mix checking REST API calls for 2. with logs >> > analysing. >> > >> > Piotrek >> > >> > >> > [1] >> > https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs >> > [2] https://issues.apache.org/jira/browse/FLINK-17012 >> > wt., 18 sie 2020 o 04:07 Zhinan Cheng <znch...@cse.cuhk.edu.hk> napisał(a): >> > >> > > Hi all, >> > > >> > > I am working on measuring the failure recovery time of Flink and I >> > > want to decompose the recovery time into different parts, say the time >> > > to detect the failure, the time to restart the job, and the time to >> > > restore the checkpointing. >> > > >> > > Unfortunately, I cannot find any information in Flink doc to solve >> > > this, Is there any way that Flink has provided for this, otherwise, >> > > how can I solve this? >> > > >> > > Thanks a lot for your help. >> > > >> > > Regards, >> > > Juno >> > > >> >> On Thu, 20 Aug 2020 at 21:26, Piotr Nowojski <pnowoj...@apache.org> wrote: >> > >> > Hi, >> > >> > > I want to decompose the recovery time into different parts, say >> > > (1) the time to detect the failure, >> > > (2) the time to restart the job, >> > > (3) and the time to restore the checkpointing. >> > >> > 1. Maybe I'm missing something, but as far as I can tell, Flink can not >> > help you with that. Time to detect the failure, would be a time between the >> > failure occurred, and the time when JobManager realises about this failure. >> > If we could reliably measure/check when the first one happened, then we >> > could immediately trigger failover. You are interested in this exactly >> > because there is no reliable way to detect the failure immediately. You >> > could approximate this via analysing the logs. >> > >> > 2. Maybe there are some metrics that you could use, if not you check use >> > the REST API [1] to monitor for the job status. Again you could also do it >> > via analysing the logs. >> > >> > 3. In the future this might be measurable using the REST API (similar as >> > the point 2.), but currently there is no way to do it that way. There is a >> > ticket for that [2]. I think currently the only way is to do it is via >> > analysing the logs. >> > >> > If you just need to do this once, I would analyse the logs manually. If you >> > want to do it many times or monitor this continuously, I would write some >> > simple script (python?) to mix checking REST API calls for 2. with logs >> > analysing. >> > >> > Piotrek >> > >> > >> > [1] >> > https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs >> > [2] https://issues.apache.org/jira/browse/FLINK-17012 >> > wt., 18 sie 2020 o 04:07 Zhinan Cheng <znch...@cse.cuhk.edu.hk> napisał(a): >> > >> > > Hi all, >> > > >> > > I am working on measuring the failure recovery time of Flink and I >> > > want to decompose the recovery time into different parts, say the time >> > > to detect the failure, the time to restart the job, and the time to >> > > restore the checkpointing. >> > > >> > > Unfortunately, I cannot find any information in Flink doc to solve >> > > this, Is there any way that Flink has provided for this, otherwise, >> > > how can I solve this? >> > > >> > > Thanks a lot for your help. >> > > >> > > Regards, >> > > Juno >> > >