Hi, Robert

Thanks for replying. To improve observability , do you think we should
expose more metrics in checkpointing? for example, in incremental
checkpoint, the time spend on uploading sst files?
https://github.com/apache/flink/blob/5b71c7f2fe36c760924848295a8090898cb10f15/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L319

Best
Lu


On Fri, Apr 17, 2020 at 11:31 AM Robert Metzger <rmetz...@apache.org> wrote:

> Hi,
> did you check the TaskManager logs if there are retries by the s3a file
> system during checkpointing?
>
> I'm not aware of any metrics in Flink that could be helpful in this
> situation.
>
> Best,
> Robert
>
> On Tue, Apr 14, 2020 at 12:02 AM Lu Niu <qqib...@gmail.com> wrote:
>
>> Hi, Flink users
>>
>> We notice sometimes async checkpointing can be extremely slow, leading to
>> checkpoint timeout. For example, For a state size around 2.5MB, it could
>> take 7~12min in async checkpointing:
>>
>> [image: Screen Shot 2020-04-09 at 5.04.30 PM.png]
>>
>> Notice all the slowness comes from async checkpointing, no delay in sync
>> part and barrier assignment. As we use rocksdb incremental checkpointing, I
>> notice the slowness might be caused by uploading the file to s3. However, I
>> am not completely sure since there are other steps in async checkpointing.
>> Does flink expose fine-granular metrics to debug such slowness?
>>
>> setup: flink 1.9.1, rocksdb incremental state backend, S3AHaoopFileSystem
>>
>> Best
>> Lu
>>
>

Reply via email to