Jinzhong Li created FLINK-35897: ----------------------------------- Summary: Some checkpoint files and localState files can't be cleanUp when checkpoint is aborted Key: FLINK-35897 URL: https://issues.apache.org/jira/browse/FLINK-35897 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing, Runtime / State Backends Reporter: Jinzhong Li
h2. Problem When the job checkpoint is canceled ([asyncsnapshotcallable.java/#L129| [https://github.com/apache/flink/blob/d4294c59e6f2ec8702f53916ea49cf23f6db8961/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]]), it is still possible for the asynchronous snapshot thread to continue executing and generate a completed checkpoint ([RocksIncrementalSnapshotStrategy.java#L324| [https://github.com/apache/flink/blob/d4294c59e6f2ec8702f53916ea49cf23f6db8961/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L324]]). In this case, there will be no role is responsible for the completed checkpoint cleanup, neither async snapshot thread, nor SubtaskCheckpointCoordinatorImpl. h3. How to reproduce it We can reproduce this issue by running the [DataGenWordCount example in my debug branch|[https://github.com/ljz2051/flink/commit/33c0c55098a49a0b56c9404256a560da5069f26c]], in which I've added some debug code. h3. How to fix it When the asynchronous snapshot thread completes a checkpoint, it needs to cleanup the completed checkpoint if it finds that the checkpoint has been canceled. -- This message was sent by Atlassian Jira (v8.20.10#820010)