Hi, Robert Thanks for relying. Yeah. After I added monitoring on the above path, it shows the slowness did come from uploading file to s3. Right now I am still investigating the issue. At the same time, I am trying PrestoS3FileSystem to check whether that can mitigate the problem.
Best Lu On Thu, Apr 23, 2020 at 8:10 AM Robert Metzger <rmetz...@apache.org> wrote: > Hi Lu, > > were you able to resolve the issue with the slow async checkpoints? > > I've added Yu Li to this thread. He has more experience with the state > backends to decide which monitoring is appropriate for such situations. > > Best, > Robert > > > On Tue, Apr 21, 2020 at 10:50 PM Lu Niu <qqib...@gmail.com> wrote: > >> Hi, Robert >> >> Thanks for replying. To improve observability , do you think we should >> expose more metrics in checkpointing? for example, in incremental >> checkpoint, the time spend on uploading sst files? >> https://github.com/apache/flink/blob/5b71c7f2fe36c760924848295a8090898cb10f15/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L319 >> >> Best >> Lu >> >> >> On Fri, Apr 17, 2020 at 11:31 AM Robert Metzger <rmetz...@apache.org> >> wrote: >> >>> Hi, >>> did you check the TaskManager logs if there are retries by the s3a file >>> system during checkpointing? >>> >>> I'm not aware of any metrics in Flink that could be helpful in this >>> situation. >>> >>> Best, >>> Robert >>> >>> On Tue, Apr 14, 2020 at 12:02 AM Lu Niu <qqib...@gmail.com> wrote: >>> >>>> Hi, Flink users >>>> >>>> We notice sometimes async checkpointing can be extremely slow, leading >>>> to checkpoint timeout. For example, For a state size around 2.5MB, it could >>>> take 7~12min in async checkpointing: >>>> >>>> [image: Screen Shot 2020-04-09 at 5.04.30 PM.png] >>>> >>>> Notice all the slowness comes from async checkpointing, no delay in >>>> sync part and barrier assignment. As we use rocksdb incremental >>>> checkpointing, I notice the slowness might be caused by uploading the file >>>> to s3. However, I am not completely sure since there are other steps in >>>> async checkpointing. Does flink expose fine-granular metrics to debug such >>>> slowness? >>>> >>>> setup: flink 1.9.1, rocksdb incremental state backend, >>>> S3AHaoopFileSystem >>>> >>>> Best >>>> Lu >>>> >>>