Hi Lu, were you able to resolve the issue with the slow async checkpoints?
I've added Yu Li to this thread. He has more experience with the state backends to decide which monitoring is appropriate for such situations. Best, Robert On Tue, Apr 21, 2020 at 10:50 PM Lu Niu <qqib...@gmail.com> wrote: > Hi, Robert > > Thanks for replying. To improve observability , do you think we should > expose more metrics in checkpointing? for example, in incremental > checkpoint, the time spend on uploading sst files? > https://github.com/apache/flink/blob/5b71c7f2fe36c760924848295a8090898cb10f15/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L319 > > Best > Lu > > > On Fri, Apr 17, 2020 at 11:31 AM Robert Metzger <rmetz...@apache.org> > wrote: > >> Hi, >> did you check the TaskManager logs if there are retries by the s3a file >> system during checkpointing? >> >> I'm not aware of any metrics in Flink that could be helpful in this >> situation. >> >> Best, >> Robert >> >> On Tue, Apr 14, 2020 at 12:02 AM Lu Niu <qqib...@gmail.com> wrote: >> >>> Hi, Flink users >>> >>> We notice sometimes async checkpointing can be extremely slow, leading >>> to checkpoint timeout. For example, For a state size around 2.5MB, it could >>> take 7~12min in async checkpointing: >>> >>> [image: Screen Shot 2020-04-09 at 5.04.30 PM.png] >>> >>> Notice all the slowness comes from async checkpointing, no delay in sync >>> part and barrier assignment. As we use rocksdb incremental checkpointing, I >>> notice the slowness might be caused by uploading the file to s3. However, I >>> am not completely sure since there are other steps in async checkpointing. >>> Does flink expose fine-granular metrics to debug such slowness? >>> >>> setup: flink 1.9.1, rocksdb incremental state backend, S3AHaoopFileSystem >>> >>> Best >>> Lu >>> >>