[jira] [Commented] (FLINK-30251) Move the IO with DFS during abort checkpoint to an asynchronous thread.

ming li (Jira) Fri, 09 Dec 2022 01:11:33 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-30251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17645200#comment-17645200
 ]


ming li commented on FLINK-30251:
---------------------------------

[~gaoyunhaii] thanks for your reply.

We have tried setting timeout to limit the waiting time of these external IO 
operations, but if the timeout is too short, it will be easily affected by 
network, and if it is too long, it will cause no data processing for a long 
time. 

 
{quote}Thus we might change it to a thread pool with a limited maximum number 
of thread and one unbounded Blocking Queue. Also since the thread in this pool 
might be blocked, we might need to use a separate thread pool.
{quote}
I think this is a good idea, can you assign me this ticket so that I can make a 
formal PR based on this suggestion.

> Move the IO with DFS during abort checkpoint to an asynchronous thread.
> -----------------------------------------------------------------------
>
>                 Key: FLINK-30251
>                 URL: https://issues.apache.org/jira/browse/FLINK-30251
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.16.0, 1.15.2
>            Reporter: ming li
>            Priority: Major
>         Attachments: image-2022-11-30-19-10-51-226.png
>
>
> Currently when the {{checkpoint}} fails, we process the abort message in the 
> Task's {{{}mailbox{}}}. We will close the output stream and delete the file 
> on DFS. 
>  
> However, when the {{checkpoint}} failure is caused by a DFS system failure 
> (for example, the namenode failure of HDFS), this operation may take a long 
> time or hang, and the task will not be able to process the data at this time.
>  
> So I think we can put the operation of deleting files in an asynchronous 
> thread just like uploading checkpoint data asynchronously.
> !image-2022-11-30-19-10-51-226.png|width=731,height=347!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-30251) Move the IO with DFS during abort checkpoint to an asynchronous thread.

Reply via email to