Rafael Zimmermann created FLINK-37393:
-----------------------------------------
Summary: Abnormally Long Checkpoint Duration After Full Checkpoint
Completion
Key: FLINK-37393
URL: https://issues.apache.org/jira/browse/FLINK-37393
Project: Flink
Issue Type: Bug
Components: Runtime / Checkpointing
Affects Versions: 1.18.0
Environment: Apache Flink 1.18.0
Pipeline processing ~5.7TB of checkpointed data
Using GCS for checkpoint storage
Reporter: Rafael Zimmermann
Attachments: Screenshot 2025-02-03 at 1.52.09 PM.png, Screenshot
2025-02-03 at 1.52.23 PM.png, Screenshot 2025-02-03 at 1.52.30 PM.png,
evidence.log
We're observing an issue where checkpoints following a full checkpoint take an
unusually long time to complete (1-2 hours) in our Flink pipeline, while normal
checkpoints typically complete within seconds/minutes.
### Observed Behavior:
- After a full checkpoint completes, the next incremental checkpoint shows
extremely high start delay
- Normal checkpoints take ~30 seconds to complete
- Full checkpoints take ~7 minutes to complete
- The problematic checkpoint after full checkpoint takes 1-2 hours
- The start delay appears to be the main contributing factor to the long
duration
### Logs and Evidence:
Full checkpoint logs showing significant gaps:
{code:java}
{"instant":{"epochSecond":1738600670,"nanoOfSecond":603000000},"thread":"flink-pekko.actor.default-dispatcher-18","level":"INFO","loggerName":"org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler","message":"Triggering
a checkpoint for job 36762c0137f9ed6a9d5e9ce3dc933871."}
...
{"instant":{"epochSecond":1738600713,"nanoOfSecond":266000000},"thread":"Checkpoint
Timer","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointRequestDecider","message":"checkpoint
request time in queue: 42663"}{code}
### Impact:
- Affects pipeline reliability
- No data loss observed, but creates operational concerns
### Potential Causes:
- Possible locking mechanism bug in Flink internals
- Issues with queued checkpoint requests
- Interaction between full and incremental checkpoint scheduling
### Attempted Workarounds:
- Manually triggering checkpoints
- Normal checkpoint operations resume after several hours
### Questions:
Is this expected behavior when full and incremental checkpoints interact?
Are there known issues with checkpoint request queuing?
Are there configuration parameters that could help mitigate this?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)