[ https://issues.apache.org/jira/browse/FLINK-28984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581190#comment-17581190 ]
ChangjiGuo edited comment on FLINK-28984 at 8/18/22 7:48 AM: ------------------------------------------------------------- Hi [~yunta], I have verified on Flink-1.15.1. My fix is to check if _FsCheckpointStateOutputStream_ has been closed after creating the output stream and clean up(include closing stream and deleting file) if closed. It works well and without the above logs. Can you take a look if you have time? Thx. was (Author: changjiguo): Hi [~yunta], I have verified on Flink-1.15.1. My fix is to check if _FsCheckpointStateOutputStream_ has been closed after creating the output stream and clean up if closed. It works well and without the above logs. Can you take a look if you have time? Thx. > FsCheckpointStateOutputStream is not being released normally > ------------------------------------------------------------ > > Key: FLINK-28984 > URL: https://issues.apache.org/jira/browse/FLINK-28984 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.11.6, 1.15.1 > Reporter: ChangjiGuo > Priority: Major > Attachments: log.png > > > If the checkpoint is aborted, AsyncSnapshotCallable will close the > snapshotCloseableRegistry when it is canceled. There may be two situations > here: > # The FSDataOutputStream has been created and closed while closing > FsCheckpointStateOutputStream. > # The FSDataOutputStream has not been created yet, but closed flag has been > set to true. You can see this in log: > {code:java} > 2022-08-16 12:55:44,161 WARN > org.apache.flink.core.fs.SafetyNetCloseableRegistry - Closing > unclosed resource via safety-net: > ClosingFSDataOutputStream(org.apache.flink.runtime.fs.hdfs.HadoopDataOutputStream@4ebe8e64) > : > xxxxx/flink/checkpoint/state/9214a2e302904b14baf2dc1aacbc7933/ae157c5a05a8922a46a179cdb4c86b10/shared/9d8a1e92-2f69-4ab0-8ce9-c1beb149229a > {code} > The output stream will be automatically closed by the > SafetyNetCloseableRegistry but the file will not be deleted. > The second case usually occurs when the storage system has high latency in > creating files. > How to reproduce? > This is not easy to reproduce, but you can try to set a smaller checkpoint > timeout and increase the parallelism of the flink job. > -- This message was sent by Atlassian Jira (v8.20.10#820010)