Hi, Robert Thanks for replying. To improve observability , do you think we should expose more metrics in checkpointing? for example, in incremental checkpoint, the time spend on uploading sst files? https://github.com/apache/flink/blob/5b71c7f2fe36c760924848295a8090898cb10f15/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L319
Best Lu On Fri, Apr 17, 2020 at 11:31 AM Robert Metzger <rmetz...@apache.org> wrote: > Hi, > did you check the TaskManager logs if there are retries by the s3a file > system during checkpointing? > > I'm not aware of any metrics in Flink that could be helpful in this > situation. > > Best, > Robert > > On Tue, Apr 14, 2020 at 12:02 AM Lu Niu <qqib...@gmail.com> wrote: > >> Hi, Flink users >> >> We notice sometimes async checkpointing can be extremely slow, leading to >> checkpoint timeout. For example, For a state size around 2.5MB, it could >> take 7~12min in async checkpointing: >> >> [image: Screen Shot 2020-04-09 at 5.04.30 PM.png] >> >> Notice all the slowness comes from async checkpointing, no delay in sync >> part and barrier assignment. As we use rocksdb incremental checkpointing, I >> notice the slowness might be caused by uploading the file to s3. However, I >> am not completely sure since there are other steps in async checkpointing. >> Does flink expose fine-granular metrics to debug such slowness? >> >> setup: flink 1.9.1, rocksdb incremental state backend, S3AHaoopFileSystem >> >> Best >> Lu >> >