[ 
https://issues.apache.org/jira/browse/FLINK-38408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18023144#comment-18023144
 ] 

Rui Fan edited comment on FLINK-38408 at 9/26/25 3:14 PM:
----------------------------------------------------------

h2. Root Cause Analysis
h3. Problem Location

Log analysis revealed that the checkpoint had actually completed successfully:
{code:java}
07:19:37,522 [jobmanager-io-thread-1] INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed 
checkpoint 1 for job b809cf46d67c23697786fd514565c737 (4464 bytes, 
checkpointDuration=45 ms, finalizationTime=4 ms){code}
However, the test code could not find the completed checkpoint when calling 
{{{}CommonTestUtils.getLatestCompletedCheckpointPath(){}}}.
h3. Root Cause

The problem occurs in the execution order of the 
{{CheckpointCoordinator.completePendingCheckpoint()}} method:

[https://github.com/apache/flink/blob/39a46288c7e74d7c5c799b48ef5a42f0c47dcaad/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1389]
{code:java}
pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint);
reportCompletedCheckpoint(completedCheckpoint); {code}
 

*Checkpoint Coordinator mechanism:*
 # {*}A{*}: 
{{pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint)}} 
completes the completion future first{{{{}}{}}}
 # {*}B{*}: {{reportCompletedCheckpoint(completedCheckpoint)}} updates 
checkpoint statistics.

Test code timeline:
 # *C:* Detect future completion
 # *D:* Call {{getLatestCompletedCheckpointPath() immediately}}

Usually, the execution sequence is A -> B -> C -> D, it works well.

The bug happens if execution sequence is A > C -> D -> B.
h2. Reproduction Method

In the {{completePendingCheckpoint()}} method, inserting {{Thread.sleep(100)}} 
between {{complete()}} and {{reportCompletedCheckpoint()}} can reproduce this 
issue 100%.
h1. Solution: Adjust the execution order in CheckpointCoordinator

*Changes:*
{code:java}
// Update statistics first 
reportCompletedCheckpoint(completedCheckpoint);
// Complete the future later 
pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint);{code}
*Benefits:*
 * Fundamentally eliminates race conditions
 * Ensures semantic correctness: Waiting parties are notified only when the 
checkpoint is fully processed


was (Author: fanrui):
h2. Root Cause Analysis
h3. Problem Location

Log analysis revealed that the checkpoint had actually completed successfully:

{{}}
{code:java}

{code}
{{07:19:37,522 [jobmanager-io-thread-1] INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed 
checkpoint 1 for job b809cf46d67c23697786fd514565c737 (4464 bytes, 
checkpointDuration=45 ms, finalizationTime=4 ms). }}

 

However, the test code could not find the completed checkpoint when calling 
{{{}CommonTestUtils.getLatestCompletedCheckpointPath(){}}}.
h3. Root Cause

The problem occurs in the execution order of the 
{{CheckpointCoordinator.completePendingCheckpoint()}} method:

[https://github.com/apache/flink/blob/39a46288c7e74d7c5c799b48ef5a42f0c47dcaad/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1389]

 
{code:java}
pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint);
reportCompletedCheckpoint(completedCheckpoint); {code}
 

*Checkpoint Coordinator mechanism:*
 # {*}A{*}: 
{{pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint)}} 
completes the completion future first{{{}{}}}
 # {*}B{*}: {{reportCompletedCheckpoint(completedCheckpoint)}} updates 
checkpoint statistics.

Test code timeline:
 # *C:* Detect future completion
 # *D:* Call {{getLatestCompletedCheckpointPath() immediately}}

{{Usually, the execution sequence is A -> B -> C -> D, it works well. }}

{{The bug happens if }}{{execution sequence is A }}{{-> C -> D }}{{-> B.}}
h2. Reproduction Method

In the {{completePendingCheckpoint()}} method, inserting {{Thread.sleep(100)}} 
between {{complete()}} and {{reportCompletedCheckpoint()}} can reproduce this 
issue 100%.
h1. Solution: Adjust the execution order in CheckpointCoordinator

*Changes:*

{{}}
{code:java}

{code}
{{// Update statistics first reportCompletedCheckpoint(completedCheckpoint); }}
{{// Complete the future later 
pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint); }}

 

*Benefits:*
 * Fundamentally eliminates race conditions
 * Ensures semantic correctness: Waiting parties are notified only when the 
checkpoint is fully processed

> MapStateNullValueCheckpointingITCase failed in test_cron_azure tests
> --------------------------------------------------------------------
>
>                 Key: FLINK-38408
>                 URL: https://issues.apache.org/jira/browse/FLINK-38408
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 2.2.0
>            Reporter: Ruan Hang
>            Assignee: Rui Fan
>            Priority: Major
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=69803&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=115e5c38-6efb-5006-4921-5e2851da71ef



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to