Hi If the bottleneck is the upload part, did you even have tried upload files using multithread[1]
[1] https://issues.apache.org/jira/browse/FLINK-11008 Best, Congxian Lu Niu <qqib...@gmail.com> 于2020年4月24日周五 下午12:38写道: > Hi, Robert > > Thanks for relying. Yeah. After I added monitoring on the above path, it > shows the slowness did come from uploading file to s3. Right now I am still > investigating the issue. At the same time, I am trying PrestoS3FileSystem > to check whether that can mitigate the problem. > > Best > Lu > > On Thu, Apr 23, 2020 at 8:10 AM Robert Metzger <rmetz...@apache.org> > wrote: > >> Hi Lu, >> >> were you able to resolve the issue with the slow async checkpoints? >> >> I've added Yu Li to this thread. He has more experience with the state >> backends to decide which monitoring is appropriate for such situations. >> >> Best, >> Robert >> >> >> On Tue, Apr 21, 2020 at 10:50 PM Lu Niu <qqib...@gmail.com> wrote: >> >>> Hi, Robert >>> >>> Thanks for replying. To improve observability , do you think we should >>> expose more metrics in checkpointing? for example, in incremental >>> checkpoint, the time spend on uploading sst files? >>> https://github.com/apache/flink/blob/5b71c7f2fe36c760924848295a8090898cb10f15/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L319 >>> >>> Best >>> Lu >>> >>> >>> On Fri, Apr 17, 2020 at 11:31 AM Robert Metzger <rmetz...@apache.org> >>> wrote: >>> >>>> Hi, >>>> did you check the TaskManager logs if there are retries by the s3a file >>>> system during checkpointing? >>>> >>>> I'm not aware of any metrics in Flink that could be helpful in this >>>> situation. >>>> >>>> Best, >>>> Robert >>>> >>>> On Tue, Apr 14, 2020 at 12:02 AM Lu Niu <qqib...@gmail.com> wrote: >>>> >>>>> Hi, Flink users >>>>> >>>>> We notice sometimes async checkpointing can be extremely slow, leading >>>>> to checkpoint timeout. For example, For a state size around 2.5MB, it >>>>> could >>>>> take 7~12min in async checkpointing: >>>>> >>>>> [image: Screen Shot 2020-04-09 at 5.04.30 PM.png] >>>>> >>>>> Notice all the slowness comes from async checkpointing, no delay in >>>>> sync part and barrier assignment. As we use rocksdb incremental >>>>> checkpointing, I notice the slowness might be caused by uploading the file >>>>> to s3. However, I am not completely sure since there are other steps in >>>>> async checkpointing. Does flink expose fine-granular metrics to debug such >>>>> slowness? >>>>> >>>>> setup: flink 1.9.1, rocksdb incremental state backend, >>>>> S3AHaoopFileSystem >>>>> >>>>> Best >>>>> Lu >>>>> >>>>