Rafael Zimmermann created FLINK-37393:
-----------------------------------------

             Summary: Abnormally Long Checkpoint Duration After Full Checkpoint 
Completion
                 Key: FLINK-37393
                 URL: https://issues.apache.org/jira/browse/FLINK-37393
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
    Affects Versions: 1.18.0
         Environment: Apache Flink 1.18.0
Pipeline processing ~5.7TB of checkpointed data
Using GCS for checkpoint storage
            Reporter: Rafael Zimmermann
         Attachments: Screenshot 2025-02-03 at 1.52.09 PM.png, Screenshot 
2025-02-03 at 1.52.23 PM.png, Screenshot 2025-02-03 at 1.52.30 PM.png, 
evidence.log

We're observing an issue where checkpoints following a full checkpoint take an 
unusually long time to complete (1-2 hours) in our Flink pipeline, while normal 
checkpoints typically complete within seconds/minutes.



### Observed Behavior:

- After a full checkpoint completes, the next incremental checkpoint shows 
extremely high start delay
- Normal checkpoints take ~30 seconds to complete
- Full checkpoints take ~7 minutes to complete
- The problematic checkpoint after full checkpoint takes 1-2 hours
- The start delay appears to be the main contributing factor to the long 
duration

### Logs and Evidence:

Full checkpoint logs showing significant gaps:


{code:java}
{"instant":{"epochSecond":1738600670,"nanoOfSecond":603000000},"thread":"flink-pekko.actor.default-dispatcher-18","level":"INFO","loggerName":"org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler","message":"Triggering
 a checkpoint for job 36762c0137f9ed6a9d5e9ce3dc933871."}
...
{"instant":{"epochSecond":1738600713,"nanoOfSecond":266000000},"thread":"Checkpoint
 
Timer","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointRequestDecider","message":"checkpoint
 request time in queue: 42663"}{code}
 

### Impact:

- Affects pipeline reliability
- No data loss observed, but creates operational concerns



### Potential Causes:

- Possible locking mechanism bug in Flink internals
- Issues with queued checkpoint requests
- Interaction between full and incremental checkpoint scheduling

### Attempted Workarounds:

- Manually triggering checkpoints
- Normal checkpoint operations resume after several hours

### Questions:

Is this expected behavior when full and incremental checkpoints interact?
Are there known issues with checkpoint request queuing?
Are there configuration parameters that could help mitigate this?

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to