Jinzhong Li created FLINK-35897:
-----------------------------------

             Summary: Some checkpoint files and localState files can't be 
cleanUp when checkpoint is aborted 
                 Key: FLINK-35897
                 URL: https://issues.apache.org/jira/browse/FLINK-35897
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing, Runtime / State Backends
            Reporter: Jinzhong Li


h2. Problem
When the job checkpoint is canceled ([asyncsnapshotcallable.java/#L129|
[https://github.com/apache/flink/blob/d4294c59e6f2ec8702f53916ea49cf23f6db8961/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]]),
 it is still possible for the asynchronous snapshot thread to continue 
executing and generate a completed checkpoint 
([RocksIncrementalSnapshotStrategy.java#L324|
[https://github.com/apache/flink/blob/d4294c59e6f2ec8702f53916ea49cf23f6db8961/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L324]]).
 In this case, there will be no role is responsible for the completed 
checkpoint cleanup, neither async snapshot thread, nor 
SubtaskCheckpointCoordinatorImpl.

 
h3. How to reproduce it 
We can reproduce this issue by running the [DataGenWordCount example in my 
debug 
branch|[https://github.com/ljz2051/flink/commit/33c0c55098a49a0b56c9404256a560da5069f26c]],
 in which I've added some debug code.
 
h3. How to fix it
When the asynchronous snapshot thread completes a checkpoint, it needs to 
cleanup the completed checkpoint if it finds that the checkpoint has been 
canceled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to