Just echo what Lu mentioned, is there documentation we can find more info on

 * when barriers were instrumented at source from checkpoint coordinator
 * when each down stream task observe first barrier of a chk
 * when list of barriers of a chk arrives to a task
 * when snapshot start/complete
 * when upload to remote file system start/complete
 * when ack send to checkpoint coordinator

For now, we only see checkpoint timeout due to a task can't finish in time in flink UI, seems limited to debug further.

Chen


On 4/24/20 10:52 PM, Congxian Qiu wrote:
Hi
If the bottleneck is the upload part, did you even have tried upload files using multithread[1]

[1] https://issues.apache.org/jira/browse/FLINK-11008
Best,
Congxian


Lu Niu <[email protected] <mailto:[email protected]>> 于2020年4月24日周五 下午12:38写道:

    Hi, Robert

    Thanks for relying. Yeah. After I added monitoring on the above
    path, it shows the slowness did come from uploading file to s3.
    Right now I am still investigating the issue. At the same time, I
    am trying PrestoS3FileSystem to check whether that can mitigate
    the problem.

    Best
    Lu

    On Thu, Apr 23, 2020 at 8:10 AM Robert Metzger
    <[email protected] <mailto:[email protected]>> wrote:

        Hi Lu,

        were you able to resolve the issue with the slow async
        checkpoints?

        I've added Yu Li to this thread. He has more experience with
        the state backends to decide which monitoring is appropriate
        for such situations.

        Best,
        Robert


        On Tue, Apr 21, 2020 at 10:50 PM Lu Niu <[email protected]
        <mailto:[email protected]>> wrote:

            Hi, Robert

            Thanks for replying. To improve observability , do you
            think we should expose more metrics in checkpointing? for
            example, in incremental checkpoint, the time spend on
            uploading sst files?
            
https://github.com/apache/flink/blob/5b71c7f2fe36c760924848295a8090898cb10f15/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L319

            Best
            Lu


            On Fri, Apr 17, 2020 at 11:31 AM Robert Metzger
            <[email protected] <mailto:[email protected]>> wrote:

                Hi,
                did you check the TaskManager logs if there are
                retries by the s3a file system during checkpointing?

                I'm not aware of any metrics in Flink that could be
                helpful in this situation.

                Best,
                Robert

                On Tue, Apr 14, 2020 at 12:02 AM Lu Niu
                <[email protected] <mailto:[email protected]>> wrote:

                    Hi, Flink users

                    We notice sometimes async checkpointing can be
                    extremely slow, leading to checkpoint timeout. For
                    example, For a state size around 2.5MB, it could
                    take 7~12min in async checkpointing:

                    Screen Shot 2020-04-09 at 5.04.30 PM.png

                    Notice all the slowness comes from async
                    checkpointing, no delay in sync part and barrier
                    assignment. As we use rocksdb incremental
                    checkpointing, I notice the slowness might be
                    caused by uploading the file to s3. However, I am
                    not completely sure since there are other steps in
                    async checkpointing. Does flink expose
                    fine-granular metrics to debug such slowness?

                    setup: flink 1.9.1, rocksdb incremental state
                    backend, S3AHaoopFileSystem

                    Best
                    Lu

Reply via email to