[jira] [Comment Edited] (FLINK-28984) FsCheckpointStateOutputStream is not being released normally

ChangjiGuo (Jira) Thu, 18 Aug 2022 00:50:13 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-28984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581190#comment-17581190
 ]


ChangjiGuo edited comment on FLINK-28984 at 8/18/22 7:48 AM:
-------------------------------------------------------------

Hi [~yunta], I have verified on Flink-1.15.1.

My fix is to check if _FsCheckpointStateOutputStream_ has been closed after 
creating the output stream and clean up(include closing stream and deleting 
file) if closed. It works well and without the above logs.

Can you take a look if you have time? Thx.


was (Author: changjiguo):
Hi [~yunta], I have verified on Flink-1.15.1.

My fix is to check if _FsCheckpointStateOutputStream_ has been closed after 
creating the output stream and clean up if closed. It works well and without 
the above logs.

Can you take a look if you have time? Thx.

> FsCheckpointStateOutputStream is not being released normally
> ------------------------------------------------------------
>
>                 Key: FLINK-28984
>                 URL: https://issues.apache.org/jira/browse/FLINK-28984
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.6, 1.15.1
>            Reporter: ChangjiGuo
>            Priority: Major
>         Attachments: log.png
>
>
> If the checkpoint is aborted, AsyncSnapshotCallable will close the 
> snapshotCloseableRegistry when it is canceled. There may be two situations 
> here:
>  # The FSDataOutputStream has been created and closed while closing 
> FsCheckpointStateOutputStream.
>  # The FSDataOutputStream has not been created yet, but closed flag has been 
> set to true. You can see this in log:
> {code:java}
> 2022-08-16 12:55:44,161 WARN  
> org.apache.flink.core.fs.SafetyNetCloseableRegistry           - Closing 
> unclosed resource via safety-net: 
> ClosingFSDataOutputStream(org.apache.flink.runtime.fs.hdfs.HadoopDataOutputStream@4ebe8e64)
>  : 
> xxxxx/flink/checkpoint/state/9214a2e302904b14baf2dc1aacbc7933/ae157c5a05a8922a46a179cdb4c86b10/shared/9d8a1e92-2f69-4ab0-8ce9-c1beb149229a
>  {code}
>         The output stream will be automatically closed by the 
> SafetyNetCloseableRegistry but the file will not be deleted.
> The second case usually occurs when the storage system has high latency in 
> creating files.
> How to reproduce?
> This is not easy to reproduce, but you can try to set a smaller checkpoint 
> timeout and increase the parallelism of the flink job.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-28984) FsCheckpointStateOutputStream is not being released normally

Reply via email to