[ 
https://issues.apache.org/jira/browse/FLINK-28984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580099#comment-17580099
 ] 

ChangjiGuo edited comment on FLINK-28984 at 8/17/22 3:06 AM:
-------------------------------------------------------------

[~yunta]  I'm sorry, maybe I didn't express it clearly. There is a prerequisite 
here that snapshotCloseableRegistry will be closed, and then the registered 
closabe will call the close method. Both 
_FsCheckpointStateOutputStream#createStream_ and 
_FsCheckpointStateOutputStream#close_ can be called at the same time. It is 
possible that the FSDataOutputStream has not been created when the close 
called(at this time, outStream is null).

In order to verify this conjecture, I printed the log at both setting closed = 
true and returning stream, as follows:

!log.png|width=741,height=304!
 


was (Author: changjiguo):
[~yunta]  I'm sorry, maybe I didn't express it clearly. There is a prerequisite 
here is to close the snapshotCloseableRegistry, and the registered closabe will 
call the close method. Both _FsCheckpointStateOutputStream#createStream_ and 
_FsCheckpointStateOutputStream#close_ can be called at the same time. It is 
possible that the FSDataOutputStream has not been created when the close 
called(at this time, outStream is null).

In order to verify this conjecture, I printed the log at both setting closed = 
true and returning stream, as follows:

!log.png|width=741,height=304!

> FsCheckpointStateOutputStream is not being released normally
> ------------------------------------------------------------
>
>                 Key: FLINK-28984
>                 URL: https://issues.apache.org/jira/browse/FLINK-28984
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.6, 1.15.1
>            Reporter: ChangjiGuo
>            Priority: Major
>         Attachments: log.png
>
>
> If the checkpoint is aborted, AsyncSnapshotCallable will close the 
> snapshotCloseableRegistry when it is canceled. There may be two situations 
> here:
>  # The FSDataOutputStream has been created and closed while closing 
> FsCheckpointStateOutputStream.
>  # The FSDataOutputStream has not been created yet, but closed flag has been 
> set to true. You can see this in log:
> {code:java}
> 2022-08-16 12:55:44,161 WARN  
> org.apache.flink.core.fs.SafetyNetCloseableRegistry           - Closing 
> unclosed resource via safety-net: 
> ClosingFSDataOutputStream(org.apache.flink.runtime.fs.hdfs.HadoopDataOutputStream@4ebe8e64)
>  : 
> xxxxx/flink/checkpoint/state/9214a2e302904b14baf2dc1aacbc7933/ae157c5a05a8922a46a179cdb4c86b10/shared/9d8a1e92-2f69-4ab0-8ce9-c1beb149229a
>  {code}
>         The output stream will be automatically closed by the 
> SafetyNetCloseableRegistry but the file will not be deleted.
> The second case usually occurs when the storage system has high latency in 
> creating files.
> How to reproduce?
> This is not easy to reproduce, but you can try to set a smaller checkpoint 
> timeout and increase the parallelism of the flink job.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to