Hi, Flink users

We notice sometimes async checkpointing can be extremely slow, leading to
checkpoint timeout. For example, For a state size around 2.5MB, it could
take 7~12min in async checkpointing:

[image: Screen Shot 2020-04-09 at 5.04.30 PM.png]

Notice all the slowness comes from async checkpointing, no delay in sync
part and barrier assignment. As we use rocksdb incremental checkpointing, I
notice the slowness might be caused by uploading the file to s3. However, I
am not completely sure since there are other steps in async checkpointing.
Does flink expose fine-granular metrics to debug such slowness?

setup: flink 1.9.1, rocksdb incremental state backend, S3AHaoopFileSystem

Best
Lu

Reply via email to