[ https://issues.apache.org/jira/browse/FLINK-38408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18023144#comment-18023144 ]
Rui Fan edited comment on FLINK-38408 at 9/26/25 3:14 PM: ---------------------------------------------------------- h2. Root Cause Analysis h3. Problem Location Log analysis revealed that the checkpoint had actually completed successfully: {code:java} 07:19:37,522 [jobmanager-io-thread-1] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 1 for job b809cf46d67c23697786fd514565c737 (4464 bytes, checkpointDuration=45 ms, finalizationTime=4 ms){code} However, the test code could not find the completed checkpoint when calling {{{}CommonTestUtils.getLatestCompletedCheckpointPath(){}}}. h3. Root Cause The problem occurs in the execution order of the {{CheckpointCoordinator.completePendingCheckpoint()}} method: [https://github.com/apache/flink/blob/39a46288c7e74d7c5c799b48ef5a42f0c47dcaad/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1389] {code:java} pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint); reportCompletedCheckpoint(completedCheckpoint); {code} *Checkpoint Coordinator mechanism:* # {*}A{*}: {{pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint)}} completes the completion future first{{{{}}{}}} # {*}B{*}: {{reportCompletedCheckpoint(completedCheckpoint)}} updates checkpoint statistics. Test code timeline: # *C:* Detect future completion # *D:* Call {{getLatestCompletedCheckpointPath() immediately}} Usually, the execution sequence is A -> B -> C -> D, it works well. The bug happens if execution sequence is A > C -> D -> B. h2. Reproduction Method In the {{completePendingCheckpoint()}} method, inserting {{Thread.sleep(100)}} between {{complete()}} and {{reportCompletedCheckpoint()}} can reproduce this issue 100%. h1. Solution: Adjust the execution order in CheckpointCoordinator *Changes:* {code:java} // Update statistics first reportCompletedCheckpoint(completedCheckpoint); // Complete the future later pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint);{code} *Benefits:* * Fundamentally eliminates race conditions * Ensures semantic correctness: Waiting parties are notified only when the checkpoint is fully processed was (Author: fanrui): h2. Root Cause Analysis h3. Problem Location Log analysis revealed that the checkpoint had actually completed successfully: {{}} {code:java} {code} {{07:19:37,522 [jobmanager-io-thread-1] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 1 for job b809cf46d67c23697786fd514565c737 (4464 bytes, checkpointDuration=45 ms, finalizationTime=4 ms). }} However, the test code could not find the completed checkpoint when calling {{{}CommonTestUtils.getLatestCompletedCheckpointPath(){}}}. h3. Root Cause The problem occurs in the execution order of the {{CheckpointCoordinator.completePendingCheckpoint()}} method: [https://github.com/apache/flink/blob/39a46288c7e74d7c5c799b48ef5a42f0c47dcaad/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1389] {code:java} pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint); reportCompletedCheckpoint(completedCheckpoint); {code} *Checkpoint Coordinator mechanism:* # {*}A{*}: {{pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint)}} completes the completion future first{{{}{}}} # {*}B{*}: {{reportCompletedCheckpoint(completedCheckpoint)}} updates checkpoint statistics. Test code timeline: # *C:* Detect future completion # *D:* Call {{getLatestCompletedCheckpointPath() immediately}} {{Usually, the execution sequence is A -> B -> C -> D, it works well. }} {{The bug happens if }}{{execution sequence is A }}{{-> C -> D }}{{-> B.}} h2. Reproduction Method In the {{completePendingCheckpoint()}} method, inserting {{Thread.sleep(100)}} between {{complete()}} and {{reportCompletedCheckpoint()}} can reproduce this issue 100%. h1. Solution: Adjust the execution order in CheckpointCoordinator *Changes:* {{}} {code:java} {code} {{// Update statistics first reportCompletedCheckpoint(completedCheckpoint); }} {{// Complete the future later pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint); }} *Benefits:* * Fundamentally eliminates race conditions * Ensures semantic correctness: Waiting parties are notified only when the checkpoint is fully processed > MapStateNullValueCheckpointingITCase failed in test_cron_azure tests > -------------------------------------------------------------------- > > Key: FLINK-38408 > URL: https://issues.apache.org/jira/browse/FLINK-38408 > Project: Flink > Issue Type: Bug > Components: Tests > Affects Versions: 2.2.0 > Reporter: Ruan Hang > Assignee: Rui Fan > Priority: Major > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=69803&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=115e5c38-6efb-5006-4921-5e2851da71ef -- This message was sent by Atlassian Jira (v8.20.10#820010)