Hi, Flink users We notice sometimes async checkpointing can be extremely slow, leading to checkpoint timeout. For example, For a state size around 2.5MB, it could take 7~12min in async checkpointing:
[image: Screen Shot 2020-04-09 at 5.04.30 PM.png] Notice all the slowness comes from async checkpointing, no delay in sync part and barrier assignment. As we use rocksdb incremental checkpointing, I notice the slowness might be caused by uploading the file to s3. However, I am not completely sure since there are other steps in async checkpointing. Does flink expose fine-granular metrics to debug such slowness? setup: flink 1.9.1, rocksdb incremental state backend, S3AHaoopFileSystem Best Lu