Hi
If the bottleneck is the upload part, did you even have tried upload files
using multithread[1]

[1] https://issues.apache.org/jira/browse/FLINK-11008
Best,
Congxian


Lu Niu <qqib...@gmail.com> 于2020年4月24日周五 下午12:38写道:

> Hi, Robert
>
> Thanks for relying. Yeah. After I added monitoring on the above path, it
> shows the slowness did come from uploading file to s3. Right now I am still
> investigating the issue. At the same time, I am trying PrestoS3FileSystem
> to check whether that can mitigate the problem.
>
> Best
> Lu
>
> On Thu, Apr 23, 2020 at 8:10 AM Robert Metzger <rmetz...@apache.org>
> wrote:
>
>> Hi Lu,
>>
>> were you able to resolve the issue with the slow async checkpoints?
>>
>> I've added Yu Li to this thread. He has more experience with the state
>> backends to decide which monitoring is appropriate for such situations.
>>
>> Best,
>> Robert
>>
>>
>> On Tue, Apr 21, 2020 at 10:50 PM Lu Niu <qqib...@gmail.com> wrote:
>>
>>> Hi, Robert
>>>
>>> Thanks for replying. To improve observability , do you think we should
>>> expose more metrics in checkpointing? for example, in incremental
>>> checkpoint, the time spend on uploading sst files?
>>> https://github.com/apache/flink/blob/5b71c7f2fe36c760924848295a8090898cb10f15/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L319
>>>
>>> Best
>>> Lu
>>>
>>>
>>> On Fri, Apr 17, 2020 at 11:31 AM Robert Metzger <rmetz...@apache.org>
>>> wrote:
>>>
>>>> Hi,
>>>> did you check the TaskManager logs if there are retries by the s3a file
>>>> system during checkpointing?
>>>>
>>>> I'm not aware of any metrics in Flink that could be helpful in this
>>>> situation.
>>>>
>>>> Best,
>>>> Robert
>>>>
>>>> On Tue, Apr 14, 2020 at 12:02 AM Lu Niu <qqib...@gmail.com> wrote:
>>>>
>>>>> Hi, Flink users
>>>>>
>>>>> We notice sometimes async checkpointing can be extremely slow, leading
>>>>> to checkpoint timeout. For example, For a state size around 2.5MB, it 
>>>>> could
>>>>> take 7~12min in async checkpointing:
>>>>>
>>>>> [image: Screen Shot 2020-04-09 at 5.04.30 PM.png]
>>>>>
>>>>> Notice all the slowness comes from async checkpointing, no delay in
>>>>> sync part and barrier assignment. As we use rocksdb incremental
>>>>> checkpointing, I notice the slowness might be caused by uploading the file
>>>>> to s3. However, I am not completely sure since there are other steps in
>>>>> async checkpointing. Does flink expose fine-granular metrics to debug such
>>>>> slowness?
>>>>>
>>>>> setup: flink 1.9.1, rocksdb incremental state backend,
>>>>> S3AHaoopFileSystem
>>>>>
>>>>> Best
>>>>> Lu
>>>>>
>>>>

Reply via email to