[ https://issues.apache.org/jira/browse/FLINK-30251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17645200#comment-17645200 ]
ming li commented on FLINK-30251: --------------------------------- [~gaoyunhaii] thanks for your reply. We have tried setting timeout to limit the waiting time of these external IO operations, but if the timeout is too short, it will be easily affected by network, and if it is too long, it will cause no data processing for a long time. {quote}Thus we might change it to a thread pool with a limited maximum number of thread and one unbounded Blocking Queue. Also since the thread in this pool might be blocked, we might need to use a separate thread pool. {quote} I think this is a good idea, can you assign me this ticket so that I can make a formal PR based on this suggestion. > Move the IO with DFS during abort checkpoint to an asynchronous thread. > ----------------------------------------------------------------------- > > Key: FLINK-30251 > URL: https://issues.apache.org/jira/browse/FLINK-30251 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing > Affects Versions: 1.16.0, 1.15.2 > Reporter: ming li > Priority: Major > Attachments: image-2022-11-30-19-10-51-226.png > > > Currently when the {{checkpoint}} fails, we process the abort message in the > Task's {{{}mailbox{}}}. We will close the output stream and delete the file > on DFS. > > However, when the {{checkpoint}} failure is caused by a DFS system failure > (for example, the namenode failure of HDFS), this operation may take a long > time or hang, and the task will not be able to process the data at this time. > > So I think we can put the operation of deleting files in an asynchronous > thread just like uploading checkpoint data asynchronously. > !image-2022-11-30-19-10-51-226.png|width=731,height=347! -- This message was sent by Atlassian Jira (v8.20.10#820010)