Re: Debug Slowness in Async Checkpointing

2020-04-29 Thread Piotr Nowojski
Hi, Yes, for example [1]. Most of the points that you mentioned are already visible in the UI and/or via metrics, just take a look at the subtask checkpoint stats. > when barriers were instrumented at source from checkpoint coordinator That’s checkpoint trigger time. > when each down stream task

Re: Debug Slowness in Async Checkpointing

2020-04-25 Thread Chen Q
Just echo what Lu mentioned, is there documentation we can find more info on * when barriers were instrumented at source from checkpoint coordinator * when each down stream task observe first barrier of a chk * when list of barriers of a chk arrives to a task * when snapshot start/complete *

Re: Debug Slowness in Async Checkpointing

2020-04-24 Thread Congxian Qiu
Hi If the bottleneck is the upload part, did you even have tried upload files using multithread[1] [1] https://issues.apache.org/jira/browse/FLINK-11008 Best, Congxian Lu Niu 于2020年4月24日周五 下午12:38写道: > Hi, Robert > > Thanks for relying. Yeah. After I added monitoring on the above path, it > sh

Re: Debug Slowness in Async Checkpointing

2020-04-23 Thread Lu Niu
Hi, Robert Thanks for relying. Yeah. After I added monitoring on the above path, it shows the slowness did come from uploading file to s3. Right now I am still investigating the issue. At the same time, I am trying PrestoS3FileSystem to check whether that can mitigate the problem. Best Lu On Thu

Re: Debug Slowness in Async Checkpointing

2020-04-23 Thread Robert Metzger
Hi Lu, were you able to resolve the issue with the slow async checkpoints? I've added Yu Li to this thread. He has more experience with the state backends to decide which monitoring is appropriate for such situations. Best, Robert On Tue, Apr 21, 2020 at 10:50 PM Lu Niu wrote: > Hi, Robert >

Re: Debug Slowness in Async Checkpointing

2020-04-21 Thread Lu Niu
Hi, Robert Thanks for replying. To improve observability , do you think we should expose more metrics in checkpointing? for example, in incremental checkpoint, the time spend on uploading sst files? https://github.com/apache/flink/blob/5b71c7f2fe36c760924848295a8090898cb10f15/flink-state-backends/

Re: Debug Slowness in Async Checkpointing

2020-04-17 Thread Robert Metzger
Hi, did you check the TaskManager logs if there are retries by the s3a file system during checkpointing? I'm not aware of any metrics in Flink that could be helpful in this situation. Best, Robert On Tue, Apr 14, 2020 at 12:02 AM Lu Niu wrote: > Hi, Flink users > > We notice sometimes async ch